The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions.
翻译:大型语言模型(LLM)在多项选择题(MCQ)基准测试中的表现常被引证为其医学能力的证明。我们假设,LLM在医学MCQ上的表现可能部分具有虚幻性,且受到医学内容知识与推理能力之外因素的驱动。为评估此假设,我们创建了一个包含配对MCQ的新型自由回答问答基准(FreeMedQA)。利用该基准,我们评估了三种前沿LLM(GPT-4o、GPT-3.5和LLama-3-70B-instruct),发现相较于多项选择题,模型在自由回答问题上的表现平均绝对下降了39.43%(p = 1.3 * 10-5),这一降幅大于人类表现的22.29%下降率。为分离MCQ格式对性能的影响,我们进行了掩蔽研究,逐步掩蔽问题题干的不同部分。在100%掩蔽条件下,LLM的多项选择题平均表现仍比随机猜测高出6.70%(p = 0.002),其中一种LLM(GPT-4o)的准确率达到37.34%。值得注意的是,所有LLM在自由回答中的表现均接近零。我们的研究结果凸显了医学MCQ基准测试在过高估计LLM医学能力方面的缺陷,并广泛揭示了利用LLM评估的自由回答问题改进人类与机器评估的潜力。