Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

LLMs have demonstrated impressive performance in answering medical questions, such as passing scores on medical licensing examinations. However, medical board exam questions or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises USMLE Step 2&3 style clinical questions. Both datasets are structured as multiple-choice question-answering tasks, where each question is accompanied by an expert-written explanation. We evaluate four LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. The inconsistency between automatic and human evaluations of model-generated explanations highlights the need to develop new metrics to support future research on explainable medical QA.

翻译：大语言模型（LLMs）在回答医学问题（如通过医学执照考试）方面展现了令人印象深刻的能力。然而，医学委员会考试题或通用临床问题并不能反映真实临床案例的复杂性。此外，由于缺乏参考性解释，我们难以评估模型决策的推理过程——而这一能力对于协助医生做出复杂医学决策至关重要。为应对这些挑战，我们构建了两个新数据集：JAMA Clinical Challenge 和 Medbullets。JAMA Clinical Challenge 包含基于复杂临床案例的问题，而 Medbullets 则涵盖美国医师执照考试（USMLE）第二和第三阶段风格的临床问题。两个数据集均以多项选择题形式构建，每个问题附有专家撰写的解释。我们使用多种提示（prompts）在四个大语言模型上评估这两个数据集。实验表明，我们的数据集比之前的基准更具挑战性。自动评估与人工评估在模型生成的解释上存在不一致性，这凸显了开发新评估指标以支持可解释医学问答（medical QA）领域未来研究的必要性。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日