Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted Medical Education and Decision Making in Radiation Oncology

CASES · 知识 (knowledge) · Performer · Better · 得分 ·

2023 年 8 月 21 日

翻译：ChatGPT-4在ACR放射肿瘤学住院医师培训（TXIT）考试与《红杂志》灰区病例中的基准测试：AI辅助医学教育与放射肿瘤学决策的潜力与挑战

Yixing Huang,Ahmed Gomaa,Sabine Semrau,Marlen Haderlein,Sebastian Lettmaier,Thomas Weissmann,Johanna Grigo,Hassen Ben Tkhayat,Benjamin Frey,Udo S. Gaipl,Luitpold V. Distel,Andreas Maier,Rainer Fietkau,Christoph Bert,Florian Putz

The potential of large language models in medicine for education and decision making purposes has been demonstrated as they achieve decent scores on medical exams such as the United States Medical Licensing Exam (USMLE) and the MedQA exam. In this work, we evaluate the performance of ChatGPT-4 in the specialized field of radiation oncology using the 38th American College of Radiology (ACR) radiation oncology in-training (TXIT) exam and the 2022 Red Journal Gray Zone cases. For the TXIT exam, ChatGPT-3.5 and ChatGPT-4 have achieved the scores of 63.65% and 74.57%, respectively, highlighting the advantage of the latest ChatGPT-4 model. Based on the TXIT exam, ChatGPT-4's strong and weak areas in radiation oncology are identified to some extent. Specifically, ChatGPT-4 demonstrates better knowledge of statistics, CNS & eye, pediatrics, biology, and physics than knowledge of bone & soft tissue and gynecology, as per the ACR knowledge domain. Regarding clinical care paths, ChatGPT-4 performs better in diagnosis, prognosis, and toxicity than brachytherapy and dosimetry. It lacks proficiency in in-depth details of clinical trials. For the Gray Zone cases, ChatGPT-4 is able to suggest a personalized treatment approach to each case with high correctness and comprehensiveness. Importantly, it provides novel treatment aspects for many cases, which are not suggested by any human experts. Both evaluations demonstrate the potential of ChatGPT-4 in medical education for the general public and cancer patients, as well as the potential to aid clinical decision-making, while acknowledging its limitations in certain domains. Because of the risk of hallucination, facts provided by ChatGPT always need to be verified.

翻译：大型语言模型在医学教育与决策中的潜力已通过其在USMLE（美国执业医师资格考试）和MedQA等医学考试中的优异表现得到证实。本研究基于第38届美国放射学会（ACR）放射肿瘤学住院医师培训（TXIT）考试及2022年《红杂志》灰区病例，评估ChatGPT-4在放射肿瘤学专业领域的性能。在TXIT考试中，ChatGPT-3.5与ChatGPT-4分别取得63.65%和74.57%的得分，凸显了最新ChatGPT-4模型的优势。基于TXIT考试，本研究在一定程度上识别了ChatGPT-4在放射肿瘤学中的优势与薄弱领域：根据ACR知识领域划分，ChatGPT-4在统计学、中枢神经系统与眼科、儿科、生物学及物理学方面的知识掌握优于骨与软组织及妇科领域；在临床护理路径方面，其在诊断、预后及毒性评估中的表现优于近距离治疗和剂量学，且缺乏对临床试验深度细节的熟练度。针对灰区病例，ChatGPT-4能以高正确性和全面性为每例患者建议个性化治疗方案，尤其值得注意的是，它为多个病例提出了人类专家未提及的创新治疗思路。两项评估均表明ChatGPT-4在面向公众及癌症患者的医学教育中具有潜力，同时具备辅助临床决策的能力，但需承认其在某些领域的局限性。由于存在幻觉风险，ChatGPT提供的所有事实信息均需经过验证。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日