The CoNLL-2003 English named entity recognition (NER) dataset has been widely used to train and evaluate NER models for almost 20 years. However, it is unclear how well models that are trained on this 20-year-old data and developed over a period of decades using the same test set will perform when applied on modern data. In this paper, we evaluate the generalization of over 20 different models trained on CoNLL-2003, and show that NER models have very different generalization. Surprisingly, we find no evidence of performance degradation in pre-trained Transformers, such as RoBERTa and T5, even when fine-tuned using decades-old data. We investigate why some models generalize well to new data while others do not, and attempt to disentangle the effects of temporal drift and overfitting due to test reuse. Our analysis suggests that most deterioration is due to temporal mismatch between the pre-training corpora and the downstream test sets. We found that four factors are important for good generalization: model architecture, number of parameters, time period of the pre-training corpus, in addition to the amount of fine-tuning data. We suggest current evaluation methods have, in some sense, underestimated progress on NER over the past 20 years, as NER models have not only improved on the original CoNLL-2003 test set, but improved even more on modern data. Our datasets can be found at https://github.com/ShuhengL/acl2023_conllpp.
翻译:CoNLL-2003英文命名实体识别(NER)数据集在过去近20年间被广泛用于训练和评估NER模型。然而,基于这一二十年前的数据训练、并在数十年间反复使用同一测试集开发的模型,其在现代数据上的表现尚不明确。本文评估了超过20种基于CoNLL-2003训练的模型的泛化能力,发现NER模型的泛化性能差异显著。令人意外的是,我们未发现预训练Transformer(如RoBERTa和T5)存在性能退化迹象,即使使用数十年前的数据进行微调。我们探究了部分模型能良好泛化至新数据而其他模型不能的原因,并尝试分离时间漂移与因测试集复用导致的过拟合效应。分析表明,多数性能退化源于预训练语料库与下游测试集之间的时间错配。研究发现,四个因素对良好泛化至关重要:模型架构、参数量、预训练语料的时间范围以及微调数据量。我们认为,当前评估方法在某种意义上低估了过去二十年NER领域的进展——NER模型不仅在原始CoNLL-2003测试集上持续改进,在现代数据上的提升更为显著。我们的数据集可从https://github.com/ShuhengL/acl2023_conllpp获取。