Supervised machine learning assumes that labeled data provide accurate measurements of the concepts models are meant to learn. Yet in practice, human labeling introduces systematic variation arising from ambiguous items, divergent interpretations, and simple mistakes. Machine learning research commonly treats all disagreement as noise, which obscures these distinctions and limits our understanding of what models actually learn. This paper reframes annotation as a measurement process and introduces a statistical framework for decomposing labeling outcomes into interpretable sources of variation: instance difficulty, annotator bias, situational noise, and relational alignment. The framework extends classical measurement-error models to accommodate both shared and individualized notions of truth, reflecting traditional and human label variation interpretations of error, and provides a diagnostic for assessing which regime better characterizes a given task. Applying the proposed model to a multi-annotator natural language inference dataset, we find empirical evidence for all four theorized components and demonstrate the effectiveness of our approach. We conclude with implications for data-centric machine learning and outline how this approach can guide the development of a more systematic science of labeling.
翻译:监督式机器学习假设标注数据能准确反映模型应学习的概念。然而在实践中,人类标注会因歧义项、解读分歧及简单错误引入系统性变异。机器学习研究通常将所有分歧视为噪声,这掩盖了这些差异,并限制了我们理解模型实际学习的内容。本文将标注重新定义为测量过程,并提出一个统计框架,将标注结果分解为可解释的变异来源:实例难度、标注者偏差、情境噪声及关系对齐。该框架扩展了经典测量误差模型,以同时容纳共享真实与个体化真实两种概念,反映传统误差与人类标注变异两种解读,并提供诊断方法评估哪种机制更适合特定任务。将所提模型应用于多标注者自然语言推理数据集后,我们发现了所有四种理论成分的经验证据并验证了方法有效性。最后我们讨论了对数据驱动机器学习的启示,并概述了该方法如何指导构建更系统的标注科学。