Many existing approaches for learning from labeled data assume the existence of gold-standard labels. According to these approaches, inter-annotator disagreement is seen as noise to be removed, either through refinement of annotation guidelines, label adjudication, or label filtering. However, annotator disagreement can rarely be totally eradicated, especially on more subjective tasks such as sentiment analysis or hate speech detection where disagreement is natural. Therefore, a new approach to learning from labeled data, called data perspectivism, seeks to leverage inter-annotator disagreement to learn models that stay true to the inherent uncertainty of the task by treating annotations as opinions of the annotators, rather than gold-standard facts. Despite this conceptual grounding, existing methods under data perspectivism are limited to using disagreement as the sole source of annotation uncertainty. To expand the possibilities of data perspectivism, we introduce Subjective Logic Encodings (SLEs), a flexible framework for constructing classification targets that explicitly encodes annotations as opinions of the annotators. Based on Subjective Logic Theory, SLEs encode labels as Dirichlet distributions and provide principled methods for encoding and aggregating various types of annotation uncertainty -- annotator confidence, reliability, and disagreement -- into the targets. We show that SLEs are a generalization of other types of label encodings as well as how to estimate models to predict SLEs using a distribution matching objective.
翻译:许多现有的基于标注数据的学习方法都假设存在黄金标准标签。根据这些方法,标注者之间的分歧被视为需要消除的噪声,通常通过细化标注指南、标签裁定或标签过滤来实现。然而,标注者之间的分歧很少能被完全消除,尤其是在情感分析或仇恨言论检测等更具主观性的任务中,分歧是自然存在的。因此,一种称为数据视角主义的新学习方法,旨在利用标注者之间的分歧来学习模型,这些模型通过将标注视为标注者的意见而非黄金标准事实,从而忠实于任务固有的不确定性。尽管有这一概念基础,但数据视角主义下的现有方法仅限于使用分歧作为标注不确定性的唯一来源。为了拓展数据视角主义的可能性,我们引入了主观逻辑编码,这是一个用于构建分类目标的灵活框架,它明确地将标注编码为标注者的意见。基于主观逻辑理论,SLEs将标签编码为狄利克雷分布,并提供了原则性的方法,用于将各种类型的标注不确定性——标注者置信度、可靠性和分歧——编码并聚合到目标中。我们展示了SLEs是其他类型标签编码的泛化,以及如何使用分布匹配目标来估计预测SLEs的模型。