Large Language Models (LLMs) like ChatGPT demonstrate significant potential in the medical field, often evaluated using multiple-choice questions (MCQs) similar to those found on the USMLE. Despite their prevalence in medical education, MCQs have limitations that might be exacerbated when assessing LLMs. To evaluate the effectiveness of MCQs in assessing the performance of LLMs, we developed a fictional medical benchmark focused on a non-existent gland, the Glianorex. This approach allowed us to isolate the knowledge of the LLM from its test-taking abilities. We used GPT-4 to generate a comprehensive textbook on the Glianorex in both English and French and developed corresponding multiple-choice questions in both languages. We evaluated various open-source, proprietary, and domain-specific LLMs using these questions in a zero-shot setting. The models achieved average scores around 67%, with minor performance differences between larger and smaller models. Performance was slightly higher in English than in French. Fine-tuned medical models showed some improvement over their base versions in English but not in French. The uniformly high performance across models suggests that traditional MCQ-based benchmarks may not accurately measure LLMs' clinical knowledge and reasoning abilities, instead highlighting their pattern recognition skills. This study underscores the need for more robust evaluation methods to better assess the true capabilities of LLMs in medical contexts.
翻译:以ChatGPT为代表的大型语言模型(LLMs)在医疗领域展现出巨大潜力,其评估常采用类似美国医师执照考试(USMLE)的多项选择题(MCQs)。尽管MCQs在医学教育中广泛应用,但其固有局限在评估LLMs时可能被进一步放大。为评估MCQs在衡量LLMs表现方面的有效性,我们构建了一个聚焦于虚构腺体"Glianorex"的医学基准测试。该方法能将LLMs的医学知识与应试能力进行有效分离。我们使用GPT-4生成了涵盖Glianorex腺体的英文与法文综合教材,并据此开发了双语多项选择题库。在零样本设定下,我们使用该题库评估了各类开源、商用及医学领域专用LLMs。所有模型平均得分约为67%,大小模型间性能差异较小。英文测试表现略优于法文测试。经微调的医学模型在英文测试中较基础版本有所提升,但在法文测试中未见改善。各模型普遍较高的得分表明,传统基于MCQ的基准测试可能无法准确衡量LLMs的临床知识与推理能力,反而凸显了其模式识别技能。本研究强调需要建立更稳健的评估方法,以更准确地评估LLMs在医学场景中的真实能力。