Advances in natural language processing techniques, such as named entity recognition and normalization to widely used standardized terminologies like UMLS or SNOMED-CT, along with the digitalization of electronic health records, have significantly advanced clinical text analysis. This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking that leverages the potential of in-domain adapted language models for biomedical text mining: initial candidate retrieval using a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish. This methodology, focused initially on content in Spanish, substantially outperforming multilingual language models designed for the same purpose. This is true even for complex scenarios involving heterogeneous medical terminologies and being trained on a subset of the original data. Our results, evaluated using top-k accuracy at 25 and other top-k metrics, demonstrate our approach's performance on two distinct clinical entity linking Gold Standard corpora, DisTEMIST (diseases) and MedProcNER (clinical procedures), outperforming previous benchmarks by 40 points in DisTEMIST and 43 points in MedProcNER, both normalized to SNOMED-CT codes. These findings highlight our approach's ability to address language-specific nuances and set a new benchmark in entity linking, offering a potent tool for enhancing the utility of digital medical records. The resulting system is of practical value, both for large scale automatic generation of structured data derived from clinical records, as well as for exhaustive extraction and harmonization of predefined clinical variables of interest.
翻译:自然语言处理技术的进步(例如命名实体识别及其与UMLS或SNOMED-CT等广泛使用的标准术语体系的规范化),以及电子健康记录的数字化,极大地推动了临床文本分析的发展。本研究提出ClinLinker,一种采用两阶段流水线进行医学实体链接的新方法,通过利用领域自适应语言模型在生物医学文本挖掘中的潜力:首先基于SapBERT的双编码器进行候选检索,随后通过交叉编码器进行重排,该交叉编码器采用对比学习策略训练,以适配西班牙语医学概念。该方法最初聚焦于西班牙语内容,其性能显著优于为相同目的设计的跨语言语言模型,即使在涉及异构医学术语体系且仅使用原始数据子集训练的复杂场景中也是如此。我们的结果通过top-25准确率及其他top-k指标评估,展示了该方法在两个不同的临床实体链接金标准语料库(DisTEMIST(疾病)与MedProcNER(临床操作))上的表现,在归一化为SNOMED-CT编码后,分别较先前基准提升了40个点和43个点。这些发现凸显了该方法处理语言特异性细微差别的能力,并在实体链接领域树立了新标杆,为提升数字医疗记录的实用性提供了强大工具。该成果系统具有实际应用价值,既可用于大规模自动生成源自临床记录的结构化数据,也可用于关键临床变量的穷尽提取与协调一致。