Wednesday, January 21, 2009

How good are you, Turker?

One common question when working with Mechanical Turk is "How good are the Turkers? Can I trust their answers?" In a previous post I gave some pointers to the existing literature on estimating Turker quality based on the returned responses and Bob Carpenter has also developed an excellent Bayesian framework for the same task.

All this line of work assumes that the only thing that we have available are the responses of the Turkers for the task in hand or potentially for previous tasks as well.

An alternative direction is to examine whether Turkers can self-report their own quality. To examine whether this direction is promising, we ran the following experiment on Amazon Mechanical Turk: We picked 1000 movie reviews from the sentiment analysis data set collected by Pang and Lee and posted them on Amazon Mechanical Turk.

We asked the participants on Mechanical Turk to read the text of a movie review, and estimate the star rating (from 0.1 to 0.9) that the movie critic assigned to the movie. We also asked users to self-report how difficult it was to estimate the rating, giving a difficulty rating of 0 to the easiest, and a rating of 4 to the most difficult.

Our first results were encouraging: There is a significant correlation between the "true" rating, assigned by the author of the reviews (not visible to the Mechanical Turk workers), and the average rating assigned by the labelers. Across the full dataset, the correlation was approximately 0.7, indicating that Mechanical Turk workers can recognize sentiment effectively. However, correlation of 0.7 is not perfect and indicates that there is a significant amount of noise.

The interesting part though is when we break down the responses by self-reported difficulty. The figure below shows the average labeler rating, as a function of the true rating, broken down by different levels of self-reported difficulty ($D=0$ are the easiest, $D\geq 3$ are the hardest.)

By computing the correlations for different levels of difficulty, we get: correlation of 0.99 (!) for reported difficulty $D=0$, 0.68 for $D=1$, 0.44 for $D=2$ and correlation of just 0.17 when $D \geq 3$.

In other words, Turkers can self-report accurately the difficulty of correctly labeling an example! Since example difficulty and labeling quality are strongly interconnected, this also means that they are good at estimating their own quality! (Puzzled how we can infer worker quality since the workers report example difficulty? Think of a well-prepared student for an exam, and a badly prepared one; the well-prepared student will find an exam to be "easy", while the badly prepared student will find the exam to be "difficult".)

So instead of devising sophisticated algorithms to estimate the labeler quality we can simply ask the Turkers: "How good are you?"