Introduction: Traditional Korean medicine (TKM) emphasizes individualized diagnosis and treatment, making AI modeling difficult due to limited data and implicit processes. GPT-3.5 and GPT-4, large language models, have shown impressive medical knowledge despite lacking medicine-specific training. This study aimed to assess the capabilities of GPT-3.5 and GPT-4 for TKM using the Korean National Licensing Examination for Korean Medicine Doctors. Methods: GPT-3.5 (February 2023) and GPT-4 (March 2023) models answered 340 questions from the 2022 examination across 12 subjects. Each question was independently evaluated five times in an initialized session. Results: GPT-3.5 and GPT-4 achieved 42.06% and 57.29% accuracy, respectively, with GPT-4 nearing passing performance. There were significant differences in accuracy by subjects, with 83.75% accuracy for neuropsychiatry compared to 28.75% for internal medicine (2). Both models showed high accuracy in recall-based and diagnosis-based questions but struggled with intervention-based ones. The accuracy for questions that require TKM-specialized knowledge was relatively lower than the accuracy for questions that do not GPT-4 showed high accuracy for table-based questions, and both models demonstrated consistent responses. A positive correlation between consistency and accuracy was observed. Conclusion: Models in this study showed near-passing performance in decision-making for TKM without domain-specific training. However, limits were also observed that were believed to be caused by culturally-biased learning. Our study suggests that foundation models have potential in culturally-adapted medicine, specifically TKM, for clinical assistance, medical education, and medical research.
翻译:引言:传统韩医学(TKM)强调个体化诊断与治疗,由于数据有限及隐性诊疗过程,其人工智能建模面临挑战。尽管缺乏医学专项训练,大型语言模型GPT-3.5和GPT-4已展现出令人瞩目的医学知识储备。本研究旨在通过韩医执业国家资格考试评估GPT-3.5和GPT-4在传统韩医学中的能力。方法:采用GPT-3.5(2023年2月版)和GPT-4(2023年3月版)模型,对2022年资格考试中涵盖12个学科的340道试题进行作答。每道试题在初始化会话中独立评估五次。结果:GPT-3.5和GPT-4的准确率分别达到42.06%和57.29%,其中GPT-4已接近及格水平。不同学科间准确率存在显著差异,神经精神病学正确率达83.75%,而内科学(2)仅为28.75%。两个模型在基于回忆和基于诊断的题目中表现优异,但在基于干预的题目中表现不佳。需要传统韩医学专业知识的题目准确率低于非专业题目。GPT-4在表格类题目中表现突出,两个模型的应答一致性较高。研究显示模型一致性与准确率呈正相关。结论:本研究中模型在无需领域专项训练的情况下,已具备接近及格水平的传统韩医学决策能力。但研究也观察到因文化偏差学习导致的局限性。本研究提示基础模型在文化适应性医学(特指传统韩医学)的临床辅助、医学教育和科研领域具有应用潜力。