Because high-quality data is like oxygen for AI systems, effectively eliciting information from crowdsourcing workers has become a first-order problem for developing high-performance machine learning algorithms. Two prevalent paradigms, spot-checking and peer prediction, enable the design of mechanisms to evaluate and incentivize high-quality data from human labelers. So far, at least three metrics have been proposed to compare the performances of these techniques [33, 8, 3]. However, different metrics lead to divergent and even contradictory results in various contexts. In this paper, we harmonize these divergent stories, showing that two of these metrics are actually the same within certain contexts and explain the divergence of the third. Moreover, we unify these different contexts by introducing \textit{Spot Check Equivalence}, which offers an interpretable metric for the effectiveness of a peer prediction mechanism. Finally, we present two approaches to compute spot check equivalence in various contexts, where simulation results verify the effectiveness of our proposed metric.
翻译:高质量数据如同人工智能系统的氧气,因此从众包工作者处有效获取信息已成为开发高性能机器学习算法的首要问题。抽查与同行预测两种主流范式能够设计机制,以评估人类标注员提供的数据质量并激励高质量产出。迄今为止,已有至少三种指标被提出用于比较这些技术的性能 [33, 8, 3]。然而,不同指标在不同场景下会导致发散甚至矛盾的结果。本文统一了这些分歧的结论,证明其中两种指标在特定情境下实质相同,并解释了第三种指标产生分歧的原因。此外,我们通过引入 Spot Check Equivalence(抽查等价性)统一了这些不同场景,该指标为同行预测机制的有效性提供了可解释的评估方法。最后,我们提出了两种在不同场景下计算抽查等价性的方法,仿真结果验证了所提指标的有效性。