Researchers have raised awareness about the harms of aggregating labels especially in subjective tasks that naturally contain disagreements among human annotators. In this work we show that models that are only provided aggregated labels show low confidence on high-disagreement data instances. While previous studies consider such instances as mislabeled, we argue that the reason the high-disagreement text instances have been hard-to-learn is that the conventional aggregated models underperform in extracting useful signals from subjective tasks. Inspired by recent studies demonstrating the effectiveness of learning from raw annotations, we investigate classifying using Multiple Ground Truth (Multi-GT) approaches. Our experiments show an improvement of confidence for the high-disagreement instances.
翻译:研究人员日益意识到聚合标签的危害,尤其是在自然包含标注者间分歧的主观任务中。本研究表明,仅提供聚合标签的模型在高分歧数据实例上表现出低置信度。尽管先前研究将此类实例视为错误标注,但我们认为高分歧文本实例难以学习的原因在于,传统聚合模型在从主观任务中提取有效信号方面表现不足。受近期研究证实原始标注学习有效性的启发,我们探究了使用多真实标签(Multi-GT)方法进行分类。实验显示,该方法提升了高分歧实例的置信度。