Training a Named Entity Recognition (NER) model often involves fixing a taxonomy of entity types. However, requirements evolve and we might need the NER model to recognize additional entity types. A simple approach is to re-annotate entire dataset with both existing and additional entity types and then train the model on the re-annotated dataset. However, this is an extremely laborious task. To remedy this, we propose a novel approach called Partial Label Model (PLM) that uses only partially annotated datasets. We experiment with 6 diverse datasets and show that PLM consistently performs better than most other approaches (0.5 - 2.5 F1), including in novel settings for taxonomy expansion not considered in prior work. The gap between PLM and all other approaches is especially large in settings where there is limited data available for the additional entity types (as much as 11 F1), thus suggesting a more cost effective approaches to taxonomy expansion.
翻译:训练命名实体识别(NER)模型通常需要固定实体类型的分类体系。然而,需求会不断演变,我们可能需要NER模型识别额外的实体类型。一种简单的方法是使用现有和额外实体类型对整个数据集重新标注,然后在此重新标注的数据集上训练模型。但这是一项极其繁琐的任务。为解决此问题,我们提出了一种名为部分标注模型(PLM)的新方法,该方法仅使用部分标注的数据集。我们在6个不同数据集上进行实验,结果表明PLM的性能始终优于大多数其他方法(F1值提升0.5-2.5),包括在先前工作未考虑的用于分类体系扩展的新场景中。当额外实体类型的可用数据有限时(F1值差异最高达11),PLM与所有其他方法之间的差距尤为显著,从而为分类体系扩展提出了一种更具成本效益的方法。