The biomedical field relies heavily on concept linking in various areas such as literature mining, graph alignment, information retrieval, question-answering, data, and knowledge integration. Although large language models (LLMs) have made significant strides in many natural language processing tasks, their effectiveness in biomedical concept mapping is yet to be fully explored. This research investigates a method that exploits the in-context learning (ICL) capabilities of large models for biomedical concept linking. The proposed approach adopts a two-stage retrieve-and-rank framework. Initially, biomedical concepts are embedded using language models, and then embedding similarity is utilized to retrieve the top candidates. These candidates' contextual information is subsequently incorporated into the prompt and processed by a large language model to re-rank the concepts. This approach achieved an accuracy of 90.% in BC5CDR disease entity normalization and 94.7% in chemical entity normalization, exhibiting a competitive performance relative to supervised learning methods. Further, it showed a significant improvement, with an over 20-point absolute increase in F1 score on an oncology matching dataset. Extensive qualitative assessments were conducted, and the benefits and potential shortcomings of using large language models within the biomedical domain were discussed. were discussed.
翻译:生物医学领域在文献挖掘、图对齐、信息检索、问答系统、数据与知识整合等多个方面高度依赖概念链接。尽管大语言模型在许多自然语言处理任务中取得了显著进展,但其在生物医学概念映射中的有效性仍有待深入探索。本研究提出了一种利用大模型上下文学习能力进行生物医学概念链接的方法。该方法采用两阶段检索-重排序框架:首先,通过语言模型对生物医学概念进行嵌入表示,并利用嵌入相似度检索出候选概念;随后,将这些候选概念的上下文信息融入提示中,由大语言模型进行重排序。在BC5CDR疾病实体归一化任务中取得了90%的准确率,在化学实体归一化任务中达到94.7%,展现出与监督学习方法相当的竞争力。此外,在肿瘤学匹配数据集上,F1分数实现了超过20个百分点的绝对提升。本研究进行了广泛的定性评估,并讨论了在生物医学领域使用大语言模型的优势与潜在不足。