Exploring the Potential of Large Language models in Traditional Korean Medicine: A Foundation Model Approach to Culturally-Adapted Healthcare

Introduction: Traditional Korean medicine (TKM) emphasizes individualized diagnosis and treatment, making AI modeling difficult due to limited data and implicit processes. GPT-3.5 and GPT-4, large language models, have shown impressive medical knowledge despite lacking medicine-specific training. This study aimed to assess the capabilities of GPT-3.5 and GPT-4 for TKM using the Korean National Licensing Examination for Korean Medicine Doctors. Methods: GPT-3.5 (February 2023) and GPT-4 (March 2023) models answered 340 questions from the 2022 examination across 12 subjects. Each question was independently evaluated five times in an initialized session. Results: GPT-3.5 and GPT-4 achieved 42.06% and 57.29% accuracy, respectively, with GPT-4 nearing passing performance. There were significant differences in accuracy by subjects, with 83.75% accuracy for neuropsychiatry compared to 28.75% for internal medicine (2). Both models showed high accuracy in recall-based and diagnosis-based questions but struggled with intervention-based ones. The accuracy for questions that require TKM-specialized knowledge was relatively lower than the accuracy for questions that do not GPT-4 showed high accuracy for table-based questions, and both models demonstrated consistent responses. A positive correlation between consistency and accuracy was observed. Conclusion: Models in this study showed near-passing performance in decision-making for TKM without domain-specific training. However, limits were also observed that were believed to be caused by culturally-biased learning. Our study suggests that foundation models have potential in culturally-adapted medicine, specifically TKM, for clinical assistance, medical education, and medical research.

翻译：引言：传统韩医学（TKM）强调个体化诊断与治疗，由于数据有限及隐性诊疗过程，其人工智能建模面临挑战。尽管缺乏医学专项训练，大型语言模型GPT-3.5和GPT-4已展现出令人瞩目的医学知识储备。本研究旨在通过韩医执业国家资格考试评估GPT-3.5和GPT-4在传统韩医学中的能力。方法：采用GPT-3.5（2023年2月版）和GPT-4（2023年3月版）模型，对2022年资格考试中涵盖12个学科的340道试题进行作答。每道试题在初始化会话中独立评估五次。结果：GPT-3.5和GPT-4的准确率分别达到42.06%和57.29%，其中GPT-4已接近及格水平。不同学科间准确率存在显著差异，神经精神病学正确率达83.75%，而内科学（2）仅为28.75%。两个模型在基于回忆和基于诊断的题目中表现优异，但在基于干预的题目中表现不佳。需要传统韩医学专业知识的题目准确率低于非专业题目。GPT-4在表格类题目中表现突出，两个模型的应答一致性较高。研究显示模型一致性与准确率呈正相关。结论：本研究中模型在无需领域专项训练的情况下，已具备接近及格水平的传统韩医学决策能力。但研究也观察到因文化偏差学习导致的局限性。本研究提示基础模型在文化适应性医学（特指传统韩医学）的临床辅助、医学教育和科研领域具有应用潜力。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

大模型如何构建“医生级”问答？谷歌DeepMInd最新《基于大型语言模型的专家级医疗问答研究》论文，提出Med-PaLM 2

专知会员服务

65+阅读 · 2023年5月21日

大模型如何适用长尾或特定领域？微软等提出《参数知识引导的增强大型语言模型》，扩展LLM的垂直化长尾适配能力

专知会员服务

88+阅读 · 2023年5月10日

GPT-4在医学上能力如何？微软OpenAI《GPT-4在医疗难题上的能力》论文

专知会员服务

115+阅读 · 2023年3月24日

Nature Medicine | 多模态的生物医学AI

专知会员服务

31+阅读 · 2022年9月25日