The potential of large language models in medicine for education and decision making purposes has been demonstrated as they achieve decent scores on medical exams such as the United States Medical Licensing Exam (USMLE) and the MedQA exam. In this work, we evaluate the performance of ChatGPT-4 in the specialized field of radiation oncology using the 38th American College of Radiology (ACR) radiation oncology in-training (TXIT) exam and the 2022 Red Journal Gray Zone cases. For the TXIT exam, ChatGPT-3.5 and ChatGPT-4 have achieved the scores of 63.65% and 74.57%, respectively, highlighting the advantage of the latest ChatGPT-4 model. Based on the TXIT exam, ChatGPT-4's strong and weak areas in radiation oncology are identified to some extent. Specifically, ChatGPT-4 demonstrates better knowledge of statistics, CNS & eye, pediatrics, biology, and physics than knowledge of bone & soft tissue and gynecology, as per the ACR knowledge domain. Regarding clinical care paths, ChatGPT-4 performs better in diagnosis, prognosis, and toxicity than brachytherapy and dosimetry. It lacks proficiency in in-depth details of clinical trials. For the Gray Zone cases, ChatGPT-4 is able to suggest a personalized treatment approach to each case with high correctness and comprehensiveness. Importantly, it provides novel treatment aspects for many cases, which are not suggested by any human experts. Both evaluations demonstrate the potential of ChatGPT-4 in medical education for the general public and cancer patients, as well as the potential to aid clinical decision-making, while acknowledging its limitations in certain domains. Because of the risk of hallucination, facts provided by ChatGPT always need to be verified.
翻译:大型语言模型在医学教育与决策中的潜力已通过其在USMLE(美国执业医师资格考试)和MedQA等医学考试中的优异表现得到证实。本研究基于第38届美国放射学会(ACR)放射肿瘤学住院医师培训(TXIT)考试及2022年《红杂志》灰区病例,评估ChatGPT-4在放射肿瘤学专业领域的性能。在TXIT考试中,ChatGPT-3.5与ChatGPT-4分别取得63.65%和74.57%的得分,凸显了最新ChatGPT-4模型的优势。基于TXIT考试,本研究在一定程度上识别了ChatGPT-4在放射肿瘤学中的优势与薄弱领域:根据ACR知识领域划分,ChatGPT-4在统计学、中枢神经系统与眼科、儿科、生物学及物理学方面的知识掌握优于骨与软组织及妇科领域;在临床护理路径方面,其在诊断、预后及毒性评估中的表现优于近距离治疗和剂量学,且缺乏对临床试验深度细节的熟练度。针对灰区病例,ChatGPT-4能以高正确性和全面性为每例患者建议个性化治疗方案,尤其值得注意的是,它为多个病例提出了人类专家未提及的创新治疗思路。两项评估均表明ChatGPT-4在面向公众及癌症患者的医学教育中具有潜力,同时具备辅助临床决策的能力,但需承认其在某些领域的局限性。由于存在幻觉风险,ChatGPT提供的所有事实信息均需经过验证。