Skip to main content
eScholarship
Open Access Publications from the University of California

Crowd-Sourcing Human Ratings of Linguistic Production

Abstract

This study examines the reliability and validity of using two types of crowd-sourced judgments to collect lexical diversity scores. Scaled and pairwise comparison approaches were used to collect data from non-expert Amazon Mechanical Turk workers. The reliability of the lexical diversity ratings for the crowd-sourced raters was assessed along with those from trained raters using a variety of reliability statistics. The validity of the ratings was examined by 1) comparing crowd-sourced and trained ratings, 2) comparing crowd-sourced and trained ratings to ratings of language proficiency, and 3) by using an objective measure of lexical diversity to predict the crowd-sourced and trained ratings. The results indicate that scaled crowd-sourced ratings showed strong reliability in terms of text and rater strata and showed fewer misfitted texts than the trained raters. The scaled crowd-sourced ratings were also strongly predicted by lexical diversity features derived from the texts themselves.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View