Thursday, May 20, 2010

Google Prediction API: Commoditization of Large-Scale Machine Learning?

Today Google has announced the availability of the Google Prediction API: In brief, it allows users to upload massive data sets into the Google Datastore and then let Google built a supervised machine learning model (aka classifier) from the data. This is simply big news!

Google seems to promise great simplicity: Upload data in CSV format and Google takes care of the rest. They select the appropriate model for the data, train the model, report accuracy statistics, and let you classify new instances. Building classifiers from large-scale datasets becomes trivial.

While I have not had the chance to access the API, this seems to be a game changer. The ability to scale models to work with massive datasets was beyond the reach of many, and now suddenly becomes a commodity. Research labs that wanted to built classifiers as tools (and not as the focus of their research) will be able to do so without requiring much expertise. Similarly, startups will be able to use a scalable machine learning infrastructure, without having access to an inhouse expert.

In a sense, it seems to bring machine learning to the masses, bringing the performance baseline to very high level. If Google Predict is "good enough", will people seek for more advanced solutions? The optimizer of MySQL pretty much sucks but it is "good enough" for many.

Will Google Predict make large-scale machine learning a commodity? Does it mean that the value is now in having the data and in feature engineering? Unclear, but definitely a plausible scenario.

I will withhold further commentary until I manage to get access to the API. But I am excited!