BELHD: Improving Biomedical Entity Linking with Homonoym Disambiguation

Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). A popular approach to the task are name-based methods, i.e. those identifying the most appropriate name in the KB for a given mention, either via dense retrieval or autoregressive modeling. However, as these methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance, especially for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene). We therefore present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. Specifically, BELHD builds upon the BioSyn (Sung et al.,2020) model introducing two crucial extensions. First, it performs a preprocessing of the KB in which it expands homonyms with an automatically chosen disambiguating string, thus enforcing unique linking decisions. Second, we introduce candidate sharing, a novel strategy to select candidates for contrastive learning that enhances the overall training signal. Experiments with 10 corpora and five entity types show that BELHD improves upon state-of-the-art approaches, achieving the best results in 6 out 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the core prediction model and thus can also improve other methods, which we exemplify for GenBioEL (Yuan et al, 2022), a generative name-based BEL approach. Code is available at: link added upon publication.

翻译：生物医学实体链接（BEL）是将实体提及映射到知识库（KB）的任务。解决该任务的主流方法是基于名称的方法，即通过稠密检索或自回归建模识别知识库中最匹配提及的名称。然而，由于这类方法直接返回知识库名称，它们无法处理同名实体——即知识库中共享完全相同名称的不同实体。这严重影响了性能，尤其在同名实体占比高的知识库（如UMLS和NCBI Gene）中。为此，我们提出BELHD（生物医学实体链接的同名消歧），一种应对该挑战的新型基于名称的方法。具体而言，BELHD在BioSyn模型（Sung等，2020）基础上引入两大关键扩展：首先，对知识库进行预处理，通过自动选择消歧字符串扩展同名实体，从而强制实现唯一性链接决策；其次，提出候选共享策略——一种为对比学习选取候选样本的新方法，以增强整体训练信号。在10个语料库及五种实体类型上的实验表明，BELHD优于现有最优方法，在10个语料库中取得6项最佳结果，平均召回率@1提升4.55个百分点。此外，该知识库预处理与核心预测模型正交，可进一步改进其他方法——我们以GenBioEL（Yuan等，2022）这一生成式基于名称的BEL方法为例进行验证。代码地址：出版后添加链接。