By Arthur Stone
This post discusses our recent finding that MTurk survey respondents have much lower levels of life satisfaction than other representative samples, even after controlling for demographic differences among samples. We urge caution in using MTurk samples for studies where lower life satisfaction could impact results.
The advent of Amazon’s Mechanical Turk program (MTurk) has been a boon for social and behavioral science researchers seeking to conveniently and inexpensively secure an adult sample of respondents for testing. Though the MTurk system was developed to provide manpower for repetitive tasks (e.g., checking spreadsheet figures), researchers quickly realized that this pool of individuals could be used for completing all manner of questionnaires. MTurk participants are readily available in large numbers, are relatively inexpensive as they are paid by the minute for tasks, and survey results are quickly returned, often in a matter of hours or days. What could be better?
Of course, individuals who sign up for MTurk tasks are not likely to be a representative sample of the United States or, for that matter, of any other geographic region. There are many obvious concerns about the sample including that the monetary incentives for the tasks, though small, may pull for those with low incomes; they may be inclined to rush through tasks to maximize income resulting in high error rates; they may be likely to be unemployed; be prone to depression and anxiety; and, given the technological nature of the system, they may be more likely to be young and tech-savvy. On the other hand, some research has supported positive qualities regarding the appropriateness for social and behavioral science research such as a high level of attentiveness to tasks.
The new findings summarized here are from a study conducted by scientists at CESR’s Dornsife Center for Self-Report Science and was published in 2019 (Stone, Walentynowicz, Schneider, Junghaenel, & Wen, Computers in Human Behavior). For this study (which was on order effects in survey questions) we ecruited 4,500 MTurk participants to complete a survey. One of the questions in the survey was about life satisfaction: the Cantril’s Self-Anchoring Scale, often referred to as Cantril Ladder. To answer the question, participants read a couple of sentences asking them to imagine a ladder with each of the steps numbered from 0, at the bottom, to 10, at the top. The top of the ladder represents “the best life possible for you” and the bottom of the ladder represents the “the worst life possible for you.” We analyzed responses from the 933 participants who completed the Cantril Ladder question first (so as not to be influenced by preceding questions).
Our prior knowledge of the Cantril Ladder was based on large scale population work and the general consensus was that people were very satisfied with their lives – producing an average rating of around 7. We were, then, quite surprised to find that the MTurk sample from our new study had a mean of 5.3. At this point, we felt we needed to bring in other Cantril Ladder data from other surveys to help figure out if the low satisfaction score in the MTurk sample was due to its particular demographic composition. For example, if younger folks typically have low Cantril scores and if the MTurk sample was predominantly younger, then this could account for the low Cantril mean with the MTurk sample. We used two additional samples in order to conduct statistical analyses that would determine whether the MTurk sample characteristics were the culprit or not. One was 221 individuals from the University of Southern California Understanding America Study, an Internet panel of approximately 7,000 respondents representing the entire United States. The other was 137,185 respondents from the Gallup Organization’s daily poll, a nationally representative telephone interview conducted over the last several years. We found that the UAS panel had a mean Cantril score of 6.9 and the Gallup poll had a mean of 7.1 – both much higher than the MTurk sample. Surprisingly, analyses of Cantril scores in the three samples indicated that demographic characteristics did not explain the differences in life satisfaction.
Another finding emerged from the analyses of the three samples: generally speaking, the differences in Cantril scores between demographic variables we examined were similar across the samples. That is, for instance, men and women had equivalent Cantril ladder differences amongst the samples (though the means were lower in the MTurk sample). One exception was that income interacted with the sample such that the gap in Cantril scores between the MTurk sample and the other samples was most pronounced for those with lower income.
Not quite satisfied yet, we searched the existing literature for instances in which a study administered the Cantril Ladder to an MTurk sample and to another, more representative sample. We found one such study published in 2017 (Whillans, et al.): they had a sample of 366 MTurkers and 1,260 individuals from an Internet panel and a Qualtrics sample of 1,802. Because they did not have an explicit interest in the Cantril scores for the samples, they did not present the means. We reanalyzed their raw data and results confirmed our prior analyses: the MTurk sample mean was 5.3, the Internet panel, 6.8, and the Qualtrics sample, 6.8. These results showed that our initial finding was not a fluke, limited perhaps, to the way we presented the study to participants (as a well-being study) or other factors.
What do we conclude from these results? First, let’s put the Cantril score difference into context: in the full Gallup daily poll sample, prior work has shown that there is a .93 point/rung decrement for those who report being disabled versus all others, which is obviously a very serious condition. This is a much smaller difference than the difference we observed for the MTurk sample. So, MTurk participants have considerably lower levels of life satisfaction than others. At this point, we can only speculate about why this is. Perhaps the demographic variables that were available for the analyses were inadequate in that they did not capture group differences pertinent to life satisfaction. Or perhaps MTurk participants view the tasks as part of their work-day whereas respondents in the Internet or Gallup samples did not view the surveys this way. This could matter because being “at work” is typically associated with unhappiness what could contaminate the life satisfaction ratings.
Finally, we consider whether or not these results suggest that the use of MTurk samples could bias social and behavioral research. Maybe. The consistency of the demographic associations for the three samples may suggest that the life satisfaction level differences will not bias associations for the MTurk group relative to the other samples. That’s a big assumption and we should not forget about the observed interaction with income. On the other hand, if it is the case that MTurk participants truly do have low life satisfaction and poor subjective well-being, then that could have an impact on studies where well-being and mood could influence results. For example, poor well-being may impact cognitive processing tasks or studies that include the recall of past events. Thus, we recommend caution using MTurk samples when one is interested in findings that can be applied to wide swaths of the population.
In summary, MTurk samples have become a staple of social and behavioral researchers and it behooves us to know as much as possible about these folks.