LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exams or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets. Datasets and code are available at https://github.com/HanjieChen/ChallengeClinicalQA. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. In-depth automatic and human evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.
翻译:大型语言模型(LLMs)在回答医学问题方面已展现出令人瞩目的性能,例如在医学执照考试中达到及格分数。然而,医学委员会考试或一般临床问题未能充分体现真实临床病例的复杂性。此外,由于缺乏参考解释,我们难以评估模型决策的推理过程,而这是支持医生进行复杂医疗决策的关键组成部分。为应对这些挑战,我们构建了两个新数据集:JAMA临床挑战和Medbullets。数据集与代码可在 https://github.com/HanjieChen/ChallengeClinicalQA 获取。JAMA临床挑战包含基于复杂临床病例的问题,而Medbullets则由模拟临床问题组成。两个数据集均构建为多项选择题回答任务,并附有专家撰写的解释。我们使用多种提示策略在两类数据集上评估了七种LLMs。实验表明,我们的数据集比现有基准更具挑战性。对模型生成解释的深度自动评估与人工评估,揭示了LLMs在可解释医学问答方面的潜力与不足。