We commonly use agreement measures to assess the utility of judgements made by human annotators in Natural Language Processing (NLP) tasks. While inter-annotator agreement is frequently used as an indication of label reliability by measuring consistency between annotators, we argue for the additional use of intra-annotator agreement to measure label stability (and annotator consistency) over time. However, in a systematic review, we find that the latter is rarely reported in this field. Calculating these measures can act as important quality control and could provide insights into why annotators disagree. We conduct exploratory annotation experiments to investigate the relationships between these measures and perceptions of subjectivity and ambiguity in text items, finding that annotators provide inconsistent responses around 25% of the time across four different NLP tasks.
翻译:在自然语言处理(NLP)任务中,我们通常使用一致性度量来评估人类标注者所做判断的效用。虽然标注者间一致性常通过测量标注者之间的稳定性来作为标签可靠性的指标,但我们主张额外使用标注者内部一致性来测量标签随时间变化的稳定性(及标注者的一致性)。然而,通过系统性综述,我们发现该领域很少报告后者。计算这些度量可作为重要的质量控制手段,并有助于理解标注者产生分歧的原因。我们通过探索性标注实验研究这些度量与文本条目主观性及歧义感知之间的关系,发现在四项不同的NLP任务中,标注者约有25%的时间会提供不一致的响应。