How good are you, Turker?
One common question when working with Mechanical Turk is "How good are the Turkers? Can I trust their answers?" In a previous post I gave some pointers to the existing literature on estimating Turker quality based on the returned responses and Bob Carpenter has also developed an excellent Bayesian framework for the same task.
All this line of work assumes that the only thing that we have available are the responses of the Turkers for the task in hand or potentially for previous tasks as well.
An alternative direction is to examine whether Turkers can self-report their own quality. To examine whether this direction is promising, we ran the following experiment on Amazon Mechanical Turk: We picked 1000 movie reviews from the sentiment analysis data set collected by Pang and Lee and posted them on Amazon Mechanical Turk.
We asked the participants on Mechanical Turk to read the text of a movie review, and estimate the star rating (from 0.1 to 0.9) that the movie critic assigned to the movie. We also asked users to self-report how difficult it was to estimate the rating, giving a difficulty rating of 0 to the easiest, and a rating of 4 to the most difficult.
Our first results were encouraging: There is a significant correlation between the "true" rating, assigned by the author of the reviews (not visible to the Mechanical Turk workers), and the average rating assigned by the labelers. Across the full dataset, the correlation was approximately 0.7, indicating that Mechanical Turk workers can recognize sentiment effectively. However, correlation of 0.7 is not perfect and indicates that there is a significant amount of noise.
The interesting part though is when we break down the responses by self-reported difficulty. The figure below shows the average labeler rating, as a function of the true rating, broken down by different levels of self-reported difficulty ($D=0$ are the easiest, $D\geq 3$ are the hardest.)By computing the correlations for different levels of difficulty, we get: correlation of 0.99 (!) for reported difficulty $D=0$, 0.68 for $D=1$, 0.44 for $D=2$ and correlation of just 0.17 when $D \geq 3$.
In other words, Turkers can self-report accurately the difficulty of correctly labeling an example! Since example difficulty and labeling quality are strongly interconnected, this also means that they are good at estimating their own quality! (Puzzled how we can infer worker quality since the workers report example difficulty? Think of a well-prepared student for an exam, and a badly prepared one; the well-prepared student will find an exam to be "easy", while the badly prepared student will find the exam to be "difficult".)
So instead of devising sophisticated algorithms to estimate the labeler quality we can simply ask the Turkers: "How good are you?"

5 comments:
Very interesting, I had not thought of having MTurkers self-report difficulty. However, are you measuring labeler quality, or label quality? It seems more likely that there are movie reviews that are easy to interpret ("It was Spielberg meets Ken Burns"), and those that are more difficult to interpret ("Sam Peckinpah meets Russ Meyers").
I suppose the answer is in whether difficulty is more closely correlated with the movie review, or with the labeler -- whether certain reviews were easy to label, or whether certain labelers found the reviews easier to label).
In this case, we measure the difficulty of labeling an example.
However, we have seen that labeler quality and label quality are actually strongly interconnected. You can attribute the errors either fully to example difficulty, or fully to labeler quality, or to an infinite number of combinations of the two. We are preparing the paper now, so you will have to take my word for it for now, and I will post later a more detailed and less-cryptic answer :-)
OK, I found the easy explanation:
Everything else being equal, a high-quality labeler tends to mark the same example as "not difficult", while a low-quality labeler will mark the same example as having high degree of difficulty.
It is pretty similar to a well prepared student finding an exam to be "easy", while a badly prepared student will find the exam to be "difficult".
In fact, my analysis of the data shows exactly that: low-quality labelers tend to report higher degree of difficulty when labeling the same examples as the high-quality labelers.
Very interesting, thanks for posting this.
Great idea. I wish I'd thought of this, especially as I've seen lots of studies that solicited coders' comments.
1) Reviewers are not consistent in star ratings.
2) Reviewers have different non-linearities (some have 20% 5-star reviews and others 1%)
3) Star ratings are composite scores from aspects such as dialogue, cinematography, acting, etc.
Reasons 1 and 2 explain variance, but item 3 is more fundamental. I think it makes sense to break reviews down into two dimensions, pos and neg. A +pos,-neg review is positive, a -pos,+neg review is negative, a +pos,+neg review is mixed, and a -pos,-neg review is meh. Of course, they could be scaled.
My own experience as a coder is that there's a huge dimension of item difficulty. It's easier to code "Barack Obama" as a person mention, and much trickier to know what to do with "James Bond" (the character), much less the "six Bob Dylans".
The reason it's so nice that their self-assessments are so good is that it's very hard to sort out item difficulty reliably (tight posterior intervals) with only a handful of coders per item.
Post a Comment