Friday, May 21, 2010

Prediction Optimizers

The announcement of the Google Prediction API created quite a lot of discussion about the future of predictive modeling. The reactions were mainly positive, but there was one common concern: Google Predict API operates as a black box. You give the data, you train, you get predictions. No selection of the training algorithm, no parameter tuning, no nothing. Black box. Data in, predictions out.

So, the natural question arises: Is it possible to do machine learning like that? Can we trust something if we do not understand the internals of the prediction mechanism?

Declarative query execution and the role of query optimizers

For me this approach, being trained as a database researcher, directly corresponds to the approach of a query optimizer. In relational databases, you upload your data, and issue declarative SQL queries to get answers. The internal query optimizer selects how to evaluate the SQL query in order to return the results in the fastest possible manner. Most of the users of relational databases today do not even know how a query is executed. Is the join executed as a hash-join or as a sort-merge join? In which order do we join the tables? How is the GROUP BY aggregation computed? I bet that 99% of the users have no idea, and they do not want to know.

Definitely, knowing how a database works can help. If you know that you will perform mainly lookup queries on a given column, you want to build an index. Or create a histogram with the distribution statistics for another column. Or create a materialized view for frequently executed queries. But even these tasks today are increasingly automated. The database tuning advisor on SQL Server routinely suggests indexes and partitionings for my databases that I would never thought of building. (And I have a PhD in databases!)

Declarative predictive modeling

I see an exact parallel of this approach in the Google Prediction API. You upload the data and Google selects for you the most promising model. (Or, more likely, they build many models and do some form of meta-learning on top of these predictions.) I would call this a "prediction optimizer" making the parallel with the query optimize that is built within every relational database system today. My prediction (pun intended) is that such prediction optimizers will be an integral part of every predictive analytics suite in the future. 

Someone may argue that it would be better to have the ability to hand tune some parts. For example, if you know something about the structure of the data, you may pass hints to the "prediction optimizer", indicating that a particular learning strategy is better. This has a direct correspondence in the query optimization world: if you know that a particular execution strategy is better, you can pass a HINT to the optimizer as part of the SQL query.

Can we do better? Yes (but 99% of the people do not care)

The obvious question is, can we do better than relying on a prediction optimizer to build a model? The answer is pretty straightforward: Yes!

In fact, if you build any custom solution for handling your data, it will be significantly better than any commercial database. Databases have a lot of extra baggage (e.g., transactions) that are not useful in all applications but are slowing down execution considerably. I will not even go to the discussion about web crawlers, financial trading systems, etc. However, these solutions come at a cost (time, people, money...). Many people just want to store and manage their data. For them, existing database systems and their behind-the-scenes optimizers are good enough!

Similarly, you can expect to see many people building predictive models "blindly" using a blackbox approach, relying on a prediction optimizer to handle the details. If people do not care about hash joins vs. sort-merge joins, I do not think that anyone will care if the prediction came from a support vector machine with a radial basis function, from a random forest, or from a Mechanical Turk worker (Yes, I had to put MTurk in the post).

The future

I know that predictions, especially about the future, are hard, but here is my take: We are going to see a market for predictive modeling suites, similar to the market for databases. Multiple vendors will built (or have already built) similar suites. In the same way that today Oracle, Microsoft, IBM, Teradata and so on, compete for the best SQL engine, we will see competition for such turnkey solutions for predictive modeling. You upload the data and then the engines complete for scalability, speed of training, and for the best ROC curve.

Let's see: Upload-train-predict. Waiting for an answer...