Thursday, February 5, 2009

Your estimated waiting time: Infinite!

In my last post I showed some statistics about the completion time of the tasks submitted on Mechanical Turk. After taking a more careful look, it seems that there is a serious problem with the MTurk marketplace: the average time for completing a task is actually infinite!

OK, got your attention?

Of course, I do not mean that nothing ever gets done on MTurk. I have completed plenty of tasks, and I am pretty sure other people did as well. However, there is a subtle problem indicating that the long term prospects for Mechanical Turk do not look good.

Let me explain my thinking.

By looking at the distribution of completion times, it was easy to visually inspect the count-count plot, and see that it is a power-law distribution. Having a power-law distribution of completion times (aka a "heavy tail" distribution) is by itself a pretty alarming fact. In contrast to the "well-behaving" systems with exponentially-distributed waiting times, a system with a heavy-tail distribution can frequently generate waiting times that are way above the average waiting time.

This is actually not a big problem by itself. Many computer systems tend to have workloads that have heavy tails of completion times, mainly due to the self-similar burstiness of the requests. Actually by looking at the distribution of the requests, it is often possible to infer the average completion time of a task, even under heavy-tail workloads.

So, here is the question of the day: What is the average completion time for a Mechanical Turk task?

The easy answer is to examine the completed past tasks and report the average time from the sample. Unfortunately, this is an incorrect answer: the sample mean and the distribution mean can be two very different values when dealing with power-law distributions.

Let's see how we can estimate the distribution of the task completion time, and (by extension) its mean and variance. We start by getting the count-count plot, which depicts on the x-axis the duration of a task in hours, and on the y-axis the number of tasks that required exactly that many hours for completion.


The first task is to take this (highly noisy) count-count plot, and convert it into the (complementary) CDF plot. We show below the revised plot, where the x-axis is again the duration of the task, and the y-axis is the probability $Pr(duration \geq x)$, i.e., the probability that a randomly picked task lasted longer than x.


The plot is now much clearer and has a power-lae structure for "small" completion times (i.e. less than 120 hours / 5 days). Then we see a drop as we approach longer task durations. This is actually an artifact of the "censoring" in the collected data sample. We have been observing the marketplace for a month now, so the samples does not contain longer completion times.

Now, given these values, how do we estimate the parameters of the power-law distribution and get the average completion time?

Since we have in our sample discrete completion times, the pdf of the power-law distribution is the Zeta distribution, with:

$Pr\{ duration = x \} = x^{-\alpha}/\zeta(\alpha)$

where $\zeta(\alpha) = \sum_{n=1}^\infty \frac{1}{n^\alpha}$ is the Riemann zeta function and $\alpha$ is the parameter of the distribution.

How can we estimate the parameter $\alpha$? As I have blogged previously, fitting a regression line on the plot is a bad idea! Instead it is much better to use the maximum likelihood estimator:

$\frac{\zeta^\prime(\hat{\alpha})}{\zeta(\hat{\alpha})} = -\frac{1}{n} \sum_{i=1}^{n} \ln(x_i)$

In our case, the $x_i$ values are the observed durations of the tasks.

I keep the task durations in a table Task_Duration(HITid, duration) in a relational database, so computing the value $\frac{1}{n} \sum_{i=1}^n \ln {x_i}$ is a simple query:

SELECT SUM(log(duration)*cnt)/SUM(cnt)
FROM (SELECT duration, COUNT(HITid) AS cnt
FROM Task_Duration GROUP BY duration) AS A

In the case of Mechanical Turk, we have:

$\frac{\zeta^\prime(\hat\alpha)}{\zeta(\hat{\alpha})} = -2.3926$

By looking up the table with the values of $\frac{\zeta^\prime(\hat\alpha)}{\zeta(\hat{\alpha})}$



we find that the most likely value for $\hat\alpha_{MTurk} = 1.35$.

And now the interesting part! A power-law distribution with $\alpha \leq 2$ does not have a well defined mean value! (It is infinite.) In such cases, the sample mean is never a good representation of the distribution mean. Instead, the mean value for the sample is expected to increase over time, without ever converging to a stable value.

Is this the case for Mechanical Turk? To check whether this is (so far) confirmed by the data collected until today, I plotted the average task duration for all tasks that finished in each date. Here is the alarming image:

In a stable system, you would expect the average completion time to increase initially (due to censoring effects, i.e. not observing long tasks) and then stabilize. However, what we can see is that indeed the average completion time is increasing, reinforcing the evidence that the system has a power-law distribution of completion times, with infinite mean. Not an encouraging sign at all!

I am now curious to investigate further the marketplace in terms of its queueing behavior. What is the arrival process? What is the serving process? What causes the bottlenecks? Is it the type of the tasks (boring, long tasks left to die?) or it is due to the behavior of the Turkers (doing a few HITs and then taking long breaks)?

Anyone more familiar with scheduling and queuing theory, please give pointers to the comments!

4 comments:

Bob Carpenter said...

I think it's simple supply and demand.

Suppose each Turker evaluates jobs and does as many of the HITs as they think are worth it (basing utility on entertainment/boredom and pay). The Turkers have different tastes and tolerances; some like graphical tasks and some like language games. Because the tasks are repetitive and tend to pay a constant rate, a Turker may do 100 HITs but get bored after that and not find the 101st HIT worth doing.

Now let's suppose that when a job's posted, lots of Turkers see it and evaluate it. As it stays up, fewer new Turkers/day evaluate whether to work on it. (Though some might come back over time if they run out of better alternatives.)

Andrew said...

If the task does not get done you can hire somebody else to do it, no? Then the practical waiting time is finite.

Panos Ipeirotis said...

Andrew, even on Mechanical Turk the waiting time will be finite, due to the imposed limit on how long a task can stay alive. Also, in practice we will never observe a task with "infinite" waiting time.

What the analysis shows is that the average waiting time will be increasing over time, essentially resulting in a system with very unpredictable waiting times.

I should also say that before doing some basic survival analysis of the tasks, it is hard to tell what exactly causes the long waiting times.

Bob Carpenter said...

Having now run the same kind of job (morphological stemming in English) on three different data sizes, with and without qualifying tests, we've found that 1K words, 5K words, and 20K words all took about the same time -- a week or so, to complete. We're kicking off a 40K words job soon, and I'll report back on how long that takes. (Maybe I'll generate some nifty Brendan O'Connor-style R plots.)

One issue is how Amazon lists jobs -- the ones with large numbers of HITS get listed first. There might be people who just don't do a task if there aren't enough instances of it.

We added a qualifying test for the 5K and 20K word instances, and found that it didn't slow down the job, but it did reduce the overall number of people working on it. If there's no test, lots of people will do one or two HITs and then give up. With a pre-test, we had 27 Turkers for the 5K-word job and 60 Turkers for the 20K-word job.

Post a Comment