Discovering entity mentions that are out of a Knowledge Base (KB) from texts plays a critical role in KB maintenance, but has not yet been fully explored. The current methods are mostly limited to the simple threshold-based approach and feature-based classification, and the datasets for evaluation are relatively rare. We propose BLINKout, a new BERT-based Entity Linking (EL) method which can identify mentions that do not have corresponding KB entities by matching them to a special NIL entity. To better utilize BERT, we propose new techniques including NIL entity representation and classification, with synonym enhancement. We also propose KB Pruning and Versioning strategies to automatically construct out-of-KB datasets from common in-KB EL datasets. Results on five datasets of clinical notes, biomedical publications, and Wikipedia articles in various domains show the advantages of BLINKout over existing methods to identify out-of-KB mentions for the medical ontologies, UMLS, SNOMED CT, and the general KB, WikiData.
翻译:从文本中发现知识库(KB)以外的实体提及对知识库维护至关重要,但尚未得到充分探索。目前的方法大多局限于简单的阈值方法及基于特征的分类,且评估数据集相对匮乏。我们提出BLINKout,一种基于BERT的新型实体链接(EL)方法,通过将提及与特殊的NIL实体匹配,识别不存在对应知识库实体的提及。为更好利用BERT,我们提出了包含NIL实体表示与分类的新技术,并引入同义词增强策略。此外,我们设计了知识库剪枝与版本控制策略,可基于常规知识库内实体链接数据集自动构建知识库外数据集。在来自临床笔记、生物医学出版物及维基百科文章等不同领域的五个数据集上的实验结果表明,BLINKout在识别医学本体(UMLS、SNOMED CT)及通用知识库(WikiData)中知识库外提及方面显著优于现有方法。