Modern named entity recognition systems have steadily improved performance in the age of larger and more powerful neural models. However, over the past several years, the state-of-the-art has seemingly hit another plateau on the benchmark CoNLL-03 English dataset. In this paper, we perform a deep dive into the test outputs of the highest-performing NER models, conducting a fine-grained evaluation of their performance by introducing new document-level annotations on the test set. We go beyond F1 scores by categorizing errors in order to interpret the true state of the art for NER and guide future work. We review previous attempts at correcting the various flaws of the test set and introduce CoNLL#, a new corrected version of the test set that addresses its systematic and most prevalent errors, allowing for low-noise, interpretable error analysis.
翻译:现代命名实体识别系统在更大、更强神经模型的时代稳步提升了性能。然而,在过去几年中,最先进的方法在基准数据集CoNLL-03英语上的性能似乎再次进入瓶颈期。本文对性能最优的NER模型在测试集上的输出进行了深入分析,通过引入测试集上新的文档级标注,对其性能开展了细粒度评估。我们超越了F1分数,通过分类错误来解读NER领域的真实最新进展,并为未来研究提供指导。本文回顾了以往修正测试集各种缺陷的尝试,并提出CoNLL#——一种修正版测试集,它解决了测试集中系统性的、最常见的错误,从而实现了低噪声、可解释的错误分析。