大型语言模型在回答与解释复杂医学问题上的基准测试 (Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions)

LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exams or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets. Datasets and code are available at https://github.com/HanjieChen/ChallengeClinicalQA. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. In-depth automatic and human evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.

翻译：大型语言模型（LLMs）在回答医学问题方面已展现出令人瞩目的性能，例如在医学执照考试中达到及格分数。然而，医学委员会考试或一般临床问题未能充分体现真实临床病例的复杂性。此外，由于缺乏参考解释，我们难以评估模型决策的推理过程，而这是支持医生进行复杂医疗决策的关键组成部分。为应对这些挑战，我们构建了两个新数据集：JAMA临床挑战和Medbullets。数据集与代码可在 https://github.com/HanjieChen/ChallengeClinicalQA 获取。JAMA临床挑战包含基于复杂临床病例的问题，而Medbullets则由模拟临床问题组成。两个数据集均构建为多项选择题回答任务，并附有专家撰写的解释。我们使用多种提示策略在两类数据集上评估了七种LLMs。实验表明，我们的数据集比现有基准更具挑战性。对模型生成解释的深度自动评估与人工评估，揭示了LLMs在可解释医学问答方面的潜力与不足。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大语言模型遇上知识图谱：问答系统中的融合与机遇

专知会员服务

28+阅读 · 2025年5月30日

结合知识增强的大型语言模型复杂问题求解综述

专知会员服务

16+阅读 · 2025年5月7日

面向统计学家的大型语言模型概述

专知会员服务

32+阅读 · 2025年3月16日

通过逻辑推理赋能大语言模型：综述

专知会员服务

32+阅读 · 2025年2月24日