We present SynCABEL (Synthetic Contextualized Augmentation for Biomedical Entity Linking), a framework that addresses a central bottleneck in supervised biomedical entity linking (BEL): the scarcity of expert-annotated training data. SynCABEL leverages large language models to generate context-rich synthetic training examples for all candidate concepts in a target knowledge base, providing broad supervision without manual annotation. We demonstrate that SynCABEL, when combined with decoder-only models and guided inference, establishes new state-of-the-art results across three widely used multilingual benchmarks: MedMentions for English, QUAERO for French, and SPACCC for Spanish. Evaluating data efficiency, we show that SynCABEL reaches the performance of full human supervision using up to 60% less annotated data, substantially reducing reliance on labor-intensive and costly expert labeling. Finally, acknowledging that standard evaluation based on exact code matching often underestimates clinically valid predictions due to ontology redundancy, we introduce an LLM-as-a-judge protocol. This analysis reveals that SynCABEL significantly improves the rate of clinically valid predictions. Our synthetic datasets, models, and code are released to support reproducibility and future research: - HuggingFace Datasets & Models - GitHub Repository
翻译:摘要:我们提出了SynCABEL(面向生物医学实体链接的合成上下文增强框架),该框架旨在解决监督式生物医学实体链接(BEL)中的核心瓶颈——专家标注训练数据的匮乏问题。SynCABEL利用大语言模型为目标知识库中所有候选概念生成富含上下文的合成训练示例,无需人工标注即可提供广泛的监督信号。我们证明,将SynCABEL与仅解码器模型及引导式推理相结合,在三个广泛使用的多语言基准测试中均取得了新的最佳结果:英文的MedMentions、法语的QUAERO以及西班牙语的SPACCC。在数据效率评估中,SynCABEL在减少高达60%标注数据的情况下即可达到完全人工监督的性能表现,显著降低了对劳动密集型且成本高昂的专家标注的依赖。最后,我们注意到基于精确代码匹配的标准评估常因本体冗余而低估临床有效预测,为此引入了一套大语言模型作为评判者的协议。该分析表明,SynCABEL显著提升了临床有效预测的比例。我们已发布合成数据集、模型及代码,以支持可重复研究与后续工作:HuggingFace数据集与模型、GitHub代码库。