Zero-shot named entity recognition (NER) is the task of detecting named entities of specific types (such as 'Person' or 'Medicine') without any training examples. Current research increasingly relies on large synthetic datasets, automatically generated to cover tens of thousands of distinct entity types, to train zero-shot NER models. However, in this paper, we find that these synthetic datasets often contain entity types that are semantically highly similar to (or even the same as) those in standard evaluation benchmarks. Because of this overlap, we argue that reported F1 scores for zero-shot NER overestimate the true capabilities of these approaches. Further, we argue that current evaluation setups provide an incomplete picture of zero-shot abilities since they do not quantify the label shift (i.e., the similarity of labels) between training and evaluation datasets. To address these issues, we propose Familiarity, a novel metric that captures both the semantic similarity between entity types in training and evaluation, as well as their frequency in the training data, to provide an estimate of label shift. It allows researchers to contextualize reported zero-shot NER scores when using custom synthetic training datasets. Further, it enables researchers to generate evaluation setups of various transfer difficulties for fine-grained analysis of zero-shot NER.
翻译:零样本命名实体识别(NER)是在没有任何训练示例的情况下检测特定类型(如“人物”或“药物”)命名实体的任务。当前研究日益依赖大型合成数据集(自动生成以覆盖数万个不同实体类型)来训练零样本NER模型。然而,本文发现这些合成数据集通常包含与标准评估基准中语义高度相似(甚至完全相同)的实体类型。由于这种重叠,我们认为已报道的零样本NER F1分数高估了这些方法的真实能力。此外,我们认为当前评估设置未能提供零样本能力的完整图景,因为它们未量化训练与评估数据集之间的标签偏移(即标签的相似性)。为解决这些问题,我们提出“熟悉度”——一种新颖的指标,既能捕捉训练与评估中实体类型间的语义相似性,也能反映它们在训练数据中的频率,从而提供标签偏移的估计。该指标使研究人员在使用自定义合成训练数据集时,能够将报道的零样本NER分数置于具体情境中。此外,它还能帮助研究人员生成具有不同迁移难度的评估设置,以进行零样本NER的细粒度分析。