$\textbf{Objectives}$: Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. However, these English-centric models encounter challenges in non-English clinical settings, primarily due to limited clinical knowledge in respective languages, a consequence of imbalanced training corpora. We systematically evaluate LLMs in the Chinese medical context and develop a novel in-context learning framework to enhance their performance. $\textbf{Materials and Methods}$: The latest China National Medical Licensing Examination (CNMLE-2022) served as the benchmark. We collected 53 medical books and 381,149 medical questions to construct the medical knowledge base and question bank. The proposed Knowledge and Few-shot Enhancement In-context Learning (KFE) framework leverages the in-context learning ability of LLMs to integrate diverse external clinical knowledge sources. We evaluated KFE with ChatGPT(GPT3.5), GPT4, Baichuan2(BC2)-7B, and BC2-13B in CNMLE-2022 and investigated the effectiveness of different pathways for incorporating LLMs with medical knowledge from 7 perspectives. $\textbf{Results}$: Directly applying ChatGPT failed to qualify for the CNMLE-2022 at a score of 51. Cooperated with the KFE, the LLMs with varying sizes yielded consistent and significant improvements. The ChatGPT's performance surged to 70.04 and GPT-4 achieved the highest score of 82.59. This surpasses the qualification threshold (60) and exceeds the average human score of 68.70. It also enabled a smaller BC2-13B to pass the examination, showcasing the great potential in low-resource settings. $\textbf{Conclusion}$: By synergizing medical knowledge through in-context learning, LLM can extend clinical insight beyond language barriers, significantly reducing language-related disparities of LLM applications and ensuring global benefit in healthcare.
翻译:$\textbf{目的}$:大型语言模型(如ChatGPT和Med-PaLM)在各类医学问答任务中表现出色。然而,这些以英语为中心的模型在非英语临床场景中面临挑战,主要由于训练语料不平衡导致其缺乏相应语言的临床知识。我们系统评估了LLM在中文学医情境下的表现,并开发了一种新型上下文学习框架以提升其性能。$\textbf{材料与方法}$:采用最新国家医师资格考试(CNMLE-2022)作为基准。我们收集了53本医学教材和381,149道医学试题,构建医学知识库和题库。提出的知识增强与少样本上下文学习框架利用LLM的上下文学习能力整合多样化的外部临床知识源。我们在CNMLE-2022上评估了KFE与ChatGPT(GPT3.5)、GPT4、百川2-7B和百川2-13B的性能,并从7个角度研究了不同路径将医学知识融入LLM的有效性。$\textbf{结果}$:直接应用ChatGPT得分为51,未通过CNMLE-2022资格线。配合KFE后,不同规模的LLM均获得持续且显著的性能提升。ChatGPT得分跃升至70.04,GPT-4以82.59分取得最高成绩。这不仅超过资格线(60分),还超越了人类平均分68.70。该框架还使较小规模的百川2-13B通过考试,展现出低资源场景下的巨大潜力。$\textbf{结论}$:通过上下文学习协同整合医学知识,LLM能够跨越语言障碍扩展临床洞察力,显著减少LLM应用中的语言相关差异,确保医疗领域的全球受益。