A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options

Purpose: We present an updated study evaluating the performance of large language models (LLMs) in answering radiation oncology physics questions, focusing on the recently released models. Methods: A set of 100 multiple choice radiation oncology physics questions, previously created by a well-experienced physicist, was used for this study. The answer options of the questions were randomly shuffled to create "new" exam sets. Five LLMs (OpenAI o1-preview, GPT-4o, LLaMA 3.1 (405B), Gemini 1.5 Pro, and Claude 3.5 Sonnet) with the versions released before September 30, 2024, were queried using these new exam sets. To evaluate their deductive reasoning capabilities, the correct answers in the questions were replaced with "None of the above." Then, the explaining-first and step-by-step instruction prompts were used to test if this strategy improved their reasoning capabilities. The performance of the LLMs was compared with the answers from medical physicists. Results: All models demonstrated expert-level performance on these questions, with o1-preview even surpassing medical physicists with a majority vote. When replacing the correct answers with "None of the above," all models exhibited a considerable decline in performance, suggesting room for improvement. The explaining-first and step-by-step instruction prompts helped enhance the reasoning capabilities of the LLaMA 3.1 (405B), Gemini 1.5 Pro, and Claude 3.5 Sonnet models. Conclusion: These recently released LLMs demonstrated expert-level performance in answering radiation oncology physics questions, exhibiting great potential to assist in radiation oncology physics training and education.

翻译：目的：我们提出一项更新研究，评估大语言模型在回答放射肿瘤物理学问题方面的性能，重点关注近期发布的模型。方法：本研究使用一套由经验丰富的物理学家先前创建的100道放射肿瘤物理学多项选择题。通过随机重排问题的答案选项，创建了"新"的考试题集。使用这些新题集对五个版本发布于2024年9月30日前的大语言模型（OpenAI o1-preview、GPT-4o、LLaMA 3.1 (405B)、Gemini 1.5 Pro和Claude 3.5 Sonnet）进行查询。为评估其演绎推理能力，将问题中的正确答案替换为"以上都不是"。随后，采用解释优先和分步指导提示策略测试该方法是否提升了模型的推理能力。将大语言模型的性能与医学物理学家的答案进行比较。结果：所有模型在这些问题上均表现出专家级水平，其中o1-preview模型通过多数投票机制甚至超越了医学物理学家。当正确答案被替换为"以上都不是"时，所有模型性能均出现显著下降，表明存在改进空间。解释优先和分步指导提示策略有效提升了LLaMA 3.1 (405B)、Gemini 1.5 Pro和Claude 3.5 Sonnet模型的推理能力。结论：这些最新发布的大语言模型在回答放射肿瘤物理学问题时展现出专家级性能，在放射肿瘤物理学培训与教育领域具有重要辅助潜力。