Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

Objectives: Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. However, these English-centric models encounter challenges in non-English clinical settings, primarily due to limited clinical knowledge in respective languages, a consequence of imbalanced training corpora. We systematically evaluate LLMs in the Chinese medical context and develop a novel in-context learning framework to enhance their performance. Materials and Methods: The latest China National Medical Licensing Examination (CNMLE-2022) served as the benchmark. We collected 53 medical books and 381,149 medical questions to construct the medical knowledge base and question bank. The proposed Knowledge and Few-shot Enhancement In-context Learning (KFE) framework leverages the in-context learning ability of LLMs to integrate diverse external clinical knowledge sources. We evaluated KFE with ChatGPT(GPT3.5), GPT4, Baichuan2-7b, and Baichuan2-13B in CNMLE-2022 and further investigated the effectiveness of different pathways for incorporating LLMs with medical knowledge from seven distinct perspectives. Results: Directly applying ChatGPT failed to qualify for the CNMLE-2022 at a score of 51. Cooperated with the KFE framework, the LLMs with varying sizes yielded consistent and significant improvements. The ChatGPT's performance surged to 70.04 and GPT-4 achieved the highest score of 82.59. This surpasses the qualification threshold (60) and exceeds the average human score of 68.70, affirming the effectiveness and robustness of the framework. It also enabled a smaller Baichuan2-13B to pass the examination, showcasing the great potential in low-resource settings. This study shed light on the optimal practices to enhance the capabilities of LLMs in non-English medical scenarios.

翻译：目的：大型语言模型（如ChatGPT和Med-PaLM）在各类医学问答任务中表现出色。然而，这些以英语为中心的模型在非英语临床场景中面临挑战，主要原因是训练语料分布不均导致相应语言中临床知识的匮乏。本研究系统评估了大型语言模型在中文医疗语境下的表现，并开发了一种新颖的上下文学习框架以提升其性能。材料与方法：以最新版中国国家医师资格考试（CNMLE-2022）为基准。我们收集了53本医学教材和381,149道医学题目，构建医学知识库与题库。所提出的知识与少样本增强上下文学习（KFE）框架利用大型语言模型的上下文学习能力，整合多样化的外部临床知识源。我们基于CNMLE-2022，对ChatGPT（GPT3.5）、GPT4、百川2-7B和百川2-13B评估了KFE框架，并从七个不同视角进一步探究了将医学知识融入LLM的不同路径的有效性。结果：直接应用ChatGPT在CNMLE-2022中得分为51，未达到合格标准。配合KFE框架后，不同规模的大型语言模型均取得一致且显著的性能提升。ChatGPT得分跃升至70.04，GPT-4以82.59分取得最高成绩。该分数不仅超过了合格线（60分），还超越了人类平均分（68.70分），验证了该框架的有效性与鲁棒性。此外，KFE框架使规模较小的百川2-13B模型成功通过考试，展现了在低资源场景下的巨大潜力。本研究揭示了在非英语医疗场景中提升大型语言模型能力的最优实践路径。