Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

$\textbf{Objectives}$: Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. However, these English-centric models encounter challenges in non-English clinical settings, primarily due to limited clinical knowledge in respective languages, a consequence of imbalanced training corpora. We systematically evaluate LLMs in the Chinese medical context and develop a novel in-context learning framework to enhance their performance. $\textbf{Materials and Methods}$: The latest China National Medical Licensing Examination (CNMLE-2022) served as the benchmark. We collected 53 medical books and 381,149 medical questions to construct the medical knowledge base and question bank. The proposed Knowledge and Few-shot Enhancement In-context Learning (KFE) framework leverages the in-context learning ability of LLMs to integrate diverse external clinical knowledge sources. We evaluated KFE with ChatGPT(GPT3.5), GPT4, Baichuan2(BC2)-7B, and BC2-13B in CNMLE-2022 and investigated the effectiveness of different pathways for incorporating LLMs with medical knowledge from 7 perspectives. $\textbf{Results}$: Directly applying ChatGPT failed to qualify for the CNMLE-2022 at a score of 51. Cooperated with the KFE, the LLMs with varying sizes yielded consistent and significant improvements. The ChatGPT's performance surged to 70.04 and GPT-4 achieved the highest score of 82.59. This surpasses the qualification threshold (60) and exceeds the average human score of 68.70. It also enabled a smaller BC2-13B to pass the examination, showcasing the great potential in low-resource settings. $\textbf{Conclusion}$: By synergizing medical knowledge through in-context learning, LLM can extend clinical insight beyond language barriers, significantly reducing language-related disparities of LLM applications and ensuring global benefit in healthcare.

翻译：$\textbf{目的}$：大型语言模型（如ChatGPT和Med-PaLM）在各类医学问答任务中表现出色。然而，这些以英语为中心的模型在非英语临床场景中面临挑战，主要由于训练语料不平衡导致其缺乏相应语言的临床知识。我们系统评估了LLM在中文学医情境下的表现，并开发了一种新型上下文学习框架以提升其性能。$\textbf{材料与方法}$：采用最新国家医师资格考试（CNMLE-2022）作为基准。我们收集了53本医学教材和381,149道医学试题，构建医学知识库和题库。提出的知识增强与少样本上下文学习框架利用LLM的上下文学习能力整合多样化的外部临床知识源。我们在CNMLE-2022上评估了KFE与ChatGPT（GPT3.5）、GPT4、百川2-7B和百川2-13B的性能，并从7个角度研究了不同路径将医学知识融入LLM的有效性。$\textbf{结果}$：直接应用ChatGPT得分为51，未通过CNMLE-2022资格线。配合KFE后，不同规模的LLM均获得持续且显著的性能提升。ChatGPT得分跃升至70.04，GPT-4以82.59分取得最高成绩。这不仅超过资格线（60分），还超越了人类平均分68.70。该框架还使较小规模的百川2-13B通过考试，展现出低资源场景下的巨大潜力。$\textbf{结论}$：通过上下文学习协同整合医学知识，LLM能够跨越语言障碍扩展临床洞察力，显著减少LLM应用中的语言相关差异，确保医疗领域的全球受益。