How to better reduce measurement variability and bias introduced by subjectivity in crowdsourced labelling remains an open question. We introduce a theoretical framework for understanding how random error and measurement bias enter into crowdsourced annotations of subjective constructs. We then propose a pipeline that combines pairwise comparison labelling with Elo scoring, and demonstrate that it outperforms the ubiquitous majority-voting method in reducing both types of measurement error. To assess the performance of the labelling approaches, we constructed an agent-based model of crowdsourced labelling that lets us introduce different types of subjectivity into the tasks. We find that under most conditions with task subjectivity, the comparison approach produced higher $f_1$ scores. Further, the comparison approach is less susceptible to inflating bias, which majority voting tends to do. To facilitate applications, we show with simulated and real-world data that the number of required random comparisons for the same classification accuracy scales log-linearly $O(N \log N)$ with the number of labelled items. We also implemented the Elo system as an open-source Python package.
翻译:如何更好地降低众包标注中因主观性引入的测量变异性和偏差仍是一个未解问题。我们提出了一个理论框架,用于理解随机误差和测量偏差如何进入主观构念的众包标注过程。接着,我们提出了一种将成对比较标注与Elo评分相结合的流程,并证明其在减少两类测量误差方面均优于普遍使用的多数投票法。为评估标注方法的性能,我们构建了一个众包标注的基于智能体模型,该模型允许我们在任务中引入不同类型的主观性。研究发现,在大多数涉及任务主观性的条件下,比较方法能产生更高的$f_1$分数。此外,比较方法不易放大偏差,而多数投票法往往倾向于放大偏差。为促进应用,我们通过模拟数据和真实数据表明,在相同分类准确率下,所需随机比较次数与标注项目数量呈对数线性关系$O(N \log N)$。我们还以开源Python包的形式实现了Elo系统。