Discovering entity mentions that are out of a Knowledge Base (KB) from texts plays a critical role in KB maintenance, but has not yet been fully explored. The current methods are mostly limited to the simple threshold-based approach and feature-based classification; the datasets for evaluation are relatively rare. In this work, we propose BLINKout, a new BERT-based Entity Linking (EL) method which can identify mentions that do not have a corresponding KB entity by matching them to a special NIL entity. To this end, we integrate novel techniques including NIL representation, NIL classification, and synonym enhancement. We also propose Ontology Pruning and Versioning strategies to construct out-of-KB mentions from normal, in-KB EL datasets. Results on four datasets of clinical notes and publications show that BLINKout outperforms existing methods to detect out-of-KB mentions for medical ontologies UMLS and SNOMED CT.
翻译:从文本中发现知识库(KB)之外的实体提及在知识库维护中扮演着关键角色,但尚未得到充分探索。当前方法大多局限于简单的基于阈值的分类方法和基于特征的分类方法,且用于评估的数据集相对稀少。在本研究中,我们提出BLINKout——一种基于BERT的新型实体链接(EL)方法,该方法通过将提及与特殊NIL实体进行匹配,能够识别出没有对应知识库实体的提及。为此,我们整合了包括NIL表示、NIL分类和同义词增强在内的一系列创新技术。我们还提出了本体剪枝和版本化策略,以从常规的、知识库内的实体链接数据集中构建知识库外提及。在四个临床笔记和出版物数据集上的实验结果表明,在检测医学本体UMLS和SNOMED CT的知识库外提及方面,BLINKout优于现有方法。