State-of-the-art deep learning entity linking methods rely on extensive human-labelled data, which is costly to acquire. Current datasets are limited in size, leading to inadequate coverage of biomedical concepts and diminished performance when applied to new data. In this work, we propose to automatically generate data to create large-scale training datasets, which allows the exploration of approaches originally developed for the task of extreme multi-label ranking in the biomedical entity linking task. We propose the hybrid X-Linker pipeline that includes different modules to link disease and chemical entity mentions to concepts in the MEDIC and the CTD-Chemical vocabularies, respectively. X-Linker was evaluated on several biomedical datasets: BC5CDR-Disease, BioRED-Disease, NCBI-Disease, BC5CDR-Chemical, BioRED-Chemical, and NLM-Chem, achieving top-1 accuracies of 0.8307, 0.7969, 0.8271, 0.9511, 0.9248, and 0.7895, respectively. X-Linker demonstrated superior performance in three datasets: BC5CDR-Disease, NCBI-Disease, and BioRED-Chemical. In contrast, SapBERT outperformed X-Linker in the remaining three datasets. Both models rely only on the mention string for their operations. The source code of X-Linker and its associated data are publicly available for performing biomedical entity linking without requiring pre-labelled entities with identifiers from specific knowledge organization systems.
翻译:最先进的深度学习实体链接方法依赖于大量人工标注数据,其获取成本高昂。现有数据集规模有限,导致对生物医学概念的覆盖不足,且应用于新数据时性能下降。本研究提出通过自动生成数据来构建大规模训练数据集,从而能够将最初为极端多标签排序任务开发的方法应用于生物医学实体链接任务。我们提出混合X-Linker流程,该流程包含不同模块,可分别将疾病和化学实体提及链接至MEDIC词典与CTD-Chemical词典中的概念。X-Linker在多个生物医学数据集上进行了评估:BC5CDR-Disease、BioRED-Disease、NCBI-Disease、BC5CDR-Chemical、BioRED-Chemical及NLM-Chem,其Top-1准确率分别达到0.8307、0.7969、0.8271、0.9511、0.9248和0.7895。X-Linker在三个数据集(BC5CDR-Disease、NCBI-Disease和BioRED-Chemical)上表现出更优性能,而SapBERT在其余三个数据集中表现优于X-Linker。两种模型均仅依赖提及字符串进行操作。X-Linker的源代码及相关数据已公开,可用于执行生物医学实体链接,且无需依赖特定知识组织系统中带有标识符的预标注实体。