Entity linking models have achieved significant success via utilizing pretrained language models to capture semantic features. However, the NIL prediction problem, which aims to identify mentions without a corresponding entity in the knowledge base, has received insufficient attention. We categorize mentions linking to NIL into Missing Entity and Non-Entity Phrase, and propose an entity linking dataset NEL that focuses on the NIL prediction problem. NEL takes ambiguous entities as seeds, collects relevant mention context in the Wikipedia corpus, and ensures the presence of mentions linking to NIL by human annotation and entity masking. We conduct a series of experiments with the widely used bi-encoder and cross-encoder entity linking models, results show that both types of NIL mentions in training data have a significant influence on the accuracy of NIL prediction. Our code and dataset can be accessed at https://github.com/solitaryzero/NIL_EL
翻译:实体链接模型通过利用预训练语言模型捕获语义特征已取得显著成功。然而,旨在识别知识库中无对应实体的提及的NIL预测问题,尚未得到充分关注。我们将链接到NIL的提及分类为缺失实体与非实体短语,并提出专注于NIL预测问题的实体链接数据集NEL。NEL以歧义实体为种子,在维基百科语料库中收集相关提及上下文,并通过人工标注与实体掩码确保存在链接到NIL的提及。我们使用广泛使用的双编码器和交叉编码器实体链接模型开展了一系列实验,结果表明训练数据中两种类型的NIL提及对NIL预测的准确性具有显著影响。我们的代码和数据集可通过 https://github.com/solitaryzero/NIL_EL 获取。