A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options

Purpose: We present an updated study evaluating the performance of large language models (LLMs) in answering radiation oncology physics questions, focusing on the recently released models. Methods: A set of 100 multiple-choice radiation oncology physics questions, previously created by a well-experienced physicist, was used for this study. The answer options of the questions were randomly shuffled to create "new" exam sets. Five LLMs -- OpenAI o1-preview, GPT-4o, LLaMA 3.1 (405B), Gemini 1.5 Pro, and Claude 3.5 Sonnet -- with the versions released before September 30, 2024, were queried using these new exam sets. To evaluate their deductive reasoning ability, the correct answer options in the questions were replaced with "None of the above." Then, the explain-first and step-by-step instruction prompts were used to test if this strategy improved their reasoning ability. The performance of the LLMs was compared with the answers from medical physicists. Results: All models demonstrated expert-level performance on these questions, with o1-preview even surpassing medical physicists with a majority vote. When replacing the correct answer options with 'None of the above', all models exhibited a considerable decline in performance, suggesting room for improvement. The explain-first and step-by-step instruction prompts helped enhance the reasoning ability of the LLaMA 3.1 (405B), Gemini 1.5 Pro, and Claude 3.5 Sonnet models. Conclusion: These recently released LLMs demonstrated expert-level performance in answering radiation oncology physics questions, exhibiting great potential to assist in radiation oncology physics education and training.

翻译：目的：本研究提出了一项更新的研究，旨在评估大型语言模型（LLMs）在回答放射肿瘤物理学问题方面的性能，重点关注近期发布的模型。方法：本研究使用了一套由经验丰富的物理学家先前创建的100道放射肿瘤物理学多项选择题。通过随机重排问题的答案选项，生成了“新”的考试集。使用这些新考试集，对五个在2024年9月30日前发布的LLMs——OpenAI o1-preview、GPT-4o、LLaMA 3.1 (405B)、Gemini 1.5 Pro和Claude 3.5 Sonnet——进行了查询。为了评估它们的演绎推理能力，将问题中的正确答案选项替换为“以上都不是”。然后，使用解释优先和分步指导提示来测试此策略是否提高了它们的推理能力。将LLMs的表现与医学物理学家的答案进行了比较。结果：所有模型在这些问题上都表现出了专家级的水平，其中o1-preview甚至通过多数投票超过了医学物理学家。当将正确答案选项替换为“以上都不是”时，所有模型的性能均出现显著下降，表明存在改进空间。解释优先和分步指导提示有助于增强LLaMA 3.1 (405B)、Gemini 1.5 Pro和Claude 3.5 Sonnet模型的推理能力。结论：这些最新发布的LLMs在回答放射肿瘤物理学问题时表现出了专家级的性能，在辅助放射肿瘤物理学教育与培训方面展现出巨大潜力。