LLMs have demonstrated impressive performance in answering medical questions, such as passing scores on medical licensing examinations. However, medical board exam questions or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises USMLE Step 2&3 style clinical questions. Both datasets are structured as multiple-choice question-answering tasks, where each question is accompanied by an expert-written explanation. We evaluate four LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. The inconsistency between automatic and human evaluations of model-generated explanations highlights the need to develop new metrics to support future research on explainable medical QA.
翻译:大型语言模型(LLMs)在回答医学问题方面展现了令人瞩目的性能,例如在医学执照考试中取得及格分数。然而,医学委员会考试题或一般临床问题并未捕捉真实临床案例的复杂性。此外,由于缺乏参考解释,我们难以评估模型决策的推理过程——而这正是支持医生进行复杂医疗决策的关键组成部分。为应对这些挑战,我们构建了两个新数据集:JAMA临床挑战(JAMA Clinical Challenge)和Medbullets。JAMA临床挑战包含基于疑难临床案例的问题,而Medbullets则由美国医师执照考试第2步和第3步风格的临床问题组成。两个数据集均采用多项选择题问答任务结构,每道题目附有专家撰写的解释。我们使用多种提示词在这两个数据集上评估了四种LLMs。实验表明,我们的数据集比先前基准更具难度。自动评估与人工评估在模型生成解释上的不一致性,凸显了制定新指标以支持未来可解释医学问答研究的必要性。