Few-shot named entity recognition (NER) detects named entities within text using only a few annotated examples. One promising line of research is to leverage natural language descriptions of each entity type: the common label PER might, for example, be verbalized as ''person entity.'' In an initial label interpretation learning phase, the model learns to interpret such verbalized descriptions of entity types. In a subsequent few-shot tagset extension phase, this model is then given a description of a previously unseen entity type (such as ''music album'') and optionally a few training examples to perform few-shot NER for this type. In this paper, we systematically explore the impact of a strong semantic prior to interpret verbalizations of new entity types by massively scaling up the number and granularity of entity types used for label interpretation learning. To this end, we leverage an entity linking benchmark to create a dataset with orders of magnitude of more distinct entity types and descriptions as currently used datasets. We find that this increased signal yields strong results in zero- and few-shot NER in in-domain, cross-domain, and even cross-lingual settings. Our findings indicate significant potential for improving few-shot NER through heuristical data-based optimization.
翻译:少样本命名实体识别(NER)旨在仅使用少量标注示例检测文本中的命名实体。一个颇具前景的研究方向是利用每个实体类型的自然语言描述:例如,常见标签PER可被表述为"人物实体"。在初始的标签解释学习阶段,模型学会理解实体类型的此类语言化描述。在后续的少样本标签集扩展阶段,该模型将获得对先前未见实体类型(如"音乐专辑")的描述,并可选择性地使用少量训练示例来针对该类型执行少样本NER。在本文中,我们通过大规模扩展用于标签解释学习的实体类型数量与粒度,系统性地探究强语义先验对解释新实体类型语言化描述的影响。为此,我们利用实体链接基准构建了一个包含数量级更多不同实体类型及描述的数据集,远超现有数据集规模。研究发现,这种增强的信号在域内、跨域甚至跨语言的零样本和少样本NER场景中均取得了强劲结果。我们的发现表明,通过启发式数据优化来改进少样本NER具有显著潜力。