Discovering entity mentions that are out of a Knowledge Base (KB) from texts plays a critical role in KB maintenance, but has not yet been fully explored. The current methods are mostly limited to the simple threshold-based approach and feature-based classification, and the datasets for evaluation are relatively rare. We propose BLINKout, a new BERT-based Entity Linking (EL) method which can identify mentions that do not have corresponding KB entities by matching them to a special NIL entity. To better utilize BERT, we propose new techniques including NIL entity representation and classification, with synonym enhancement. We also apply KB Pruning and Versioning strategies to automatically construct out-of-KB datasets from common in-KB EL datasets. Results on five datasets of clinical notes, biomedical publications, and Wikipedia articles in various domains show the advantages of BLINKout over existing methods to identify out-of-KB mentions for the medical ontologies, UMLS, SNOMED CT, and the general KB, WikiData.
翻译:从文本中发现知识库(Knowledge Base, KB)之外的实体提及在KB维护中扮演着关键角色,但尚未得到充分探索。当前方法大多局限于简单的阈值法和基于特征的分类,且用于评估的数据集相对稀少。我们提出BLINKout,一种基于BERT的新型实体链接(Entity Linking, EL)方法,通过将提及匹配到特殊的NIL实体,从而识别出没有对应KB实体的提及。为更好地利用BERT,我们提出了包括NIL实体表示与分类、同义词增强在内的新技术。我们还应用知识库剪枝与版本化策略,从常见的内知识库实体链接数据集中自动构建外知识库数据集。在五个涵盖临床笔记、生物医学出版物及多个领域维基百科文章的数据集上的实验结果表明,BLINKout在识别医学本体(UMLS、SNOMED CT)及通用知识库(WikiData)的外知识库提及方面,相较于现有方法具有显著优势。