Named Entity Recognition (NER) is a key information extraction task with a long-standing tradition. While recent studies address and aim to correct annotation errors via re-labeling efforts, little is known about the sources of human label variation, such as text ambiguity, annotation error, or guideline divergence. This is especially the case for high-quality datasets and beyond English CoNLL03. This paper studies disagreements in expert-annotated named entity datasets for three languages: English, Danish, and Bavarian. We show that text ambiguity and artificial guideline changes are dominant factors for diverse annotations among high-quality revisions. We survey student annotations on a subset of difficult entities and substantiate the feasibility and necessity of manifold annotations for understanding named entity ambiguities from a distributional perspective.
翻译:命名实体识别(NER)是一项具有长期传统的关键信息抽取任务。尽管近期研究通过重新标注工作试图纠正标注错误,但关于人工标注变异来源(如文本歧义、标注错误或标注指南分歧)的认知仍十分有限,尤其是在高质量数据集及英语CoNLL03之外的研究中尤为突出。本文针对三种语言(英语、丹麦语、巴伐利亚语)的专家标注命名实体数据集中的标注分歧展开研究。结果表明,文本歧义与人为标注指南变更是高质量修订版本中导致标注多样性的主导因素。我们通过对困难实体子集进行学生标注调查,从分布视角证实了多视角标注对于理解命名实体歧义的可行性与必要性。