Monitoring the Dynamics of Mechanical Turk
Every time that I post a task on Mechanical Turk, I have the same experience: Turkers start working very fast on the posted HITs, and it seems that the task will be completed in a few hours. Then, as time progresses, the rate slows down.
So, I have been wondering why this happens? Did someone else post many HITs, burying my own? Does it make sense to post my HITs at some specific day or time, when Turkers work a lot and there are only a few competing HITs?
To answer this question, I started collecting data from Mechanical Turk, so that I can examine the dynamics of the system. As a first outcome of this effort, I built a small dashboard that shows the number of posted projects, the number of available HITs (a project may have many HITs), and the total amount of rewards that is available on MTurk.
As a side note, it was a nice opportunity for me to actually write some code. I built the dashboard using the Google Visualization API (pretty cool!). Now I am learning about exporting data sets using the Google Data Source API's, which will allow for easy embedding of the generated charts and will allow third parties to get direct access to my data.
Since this is the very first version of the dashboard, I welcome any comments. What else you would like to see? Is there something that you do not like? Let me know in the comments!

6 comments:
interesting Panos
/g
I found when I posted the named entity annotation task that I got lots of Turkers trying out one instance and then going away. Some Turkers did a lot of instances; especially ones that were really sloppy.
With high enough rewards, I'm thinking you'd get lots more people going on to do more HITs of the same type.
What I'm really wondering is whether we could get big tasks done. Could I get all of Wikipedia named-entity annotated? Probably not. We estimate named entity would cost at least 20 cents/400 words, but found dropoff after a day in completion rates with 200K words posted in 500 HIT instances. 1G words would cost $500K, and even if we had that kind of annotation budget, I'm thinking we'd run out of Turkers unless we paid a whole lot more.
Yeah, I've definitely experienced these fall-offs for I think the same reasons you two describe -- only a small number of Turkers decide to stick around and keep doing the work.
Have also seen others do similar back-of-the-envelope calculations for NE-annotating wikipedia -- lots of great minds :)
I think wikipedia NER is easier than newswire or academic paper NER for other reasons, because they have a writing policy of hyperlinking the first mention of an entity. If you think having a Wikipedia page is a good definition of an entity, then being hyperlinked is a really high-precision indicator. Also then you can get later mentions of that same entity by looking for substring matches to previously hyperlinked terms in the same article.
Microsoft's been working with exactly the idea Brendan suggests:
W. Dakka and S. Cucerzan. 2008.
Augmenting Wikipedia with Named Entity Tags. IJCNLP.
In fact, Dakka co-wrote a paper with Panos. Small world.
Wikipedia may have a policy, but it's hard for writers to follow. If the page doesn't exist, they have to decide whether to link into the aether. And a page that was once unambiguous may become ambiguous.
And of course, the first instance on an entity's own page isn't hyperlinked to itself.
Consider JFK's page:
http://en.wikipedia.org/wiki/John_F._Kennedy
The first instance of "John F. Kennedy" in the body of the article is in the phrase "John F. Kennedy Library", which happens to have its own page, so it's linked, too. Like most news-like writing, after the first mention, a shortened form of the name tends to be used, such as "Kennedy" or even a pronoun, like "he", so it's hard to lean on exact matching to fill in instances. Similarly, the "house select committee on assassinations" is just called "HSCA" after it's introduced.
There's also the vexing issue of adjectives: is "American" an entity, or just "America"?
There are also nested entities, with "president of the United States" linking the POTUS page; the phrase "United States" is not itself a link (no nested links in today's HTML!).
What is a "project" here? Is that different from requester?
Project is the "group of HITs", and corresponds to a set of HITs with the same specification, grouped into a single box by MTurk.
A requester can submit many different projects. Each project can have many "HITs Available" and each HIT is paid a particular reward.
Post a Comment