Towards Expert-Level Medical Question Answering with Large Language Models

Karan Singhal,Tao Tu,Juraj Gottweis,Rory Sayres,Ellery Wulczyn,Le Hou,Kevin Clark,Stephen Pfohl,Heather Cole-Lewis,Darlene Neal,Mike Schaekermann,Amy Wang,Mohamed Amin,Sami Lachgar,Philip Mansfield,Sushant Prakash,Bradley Green,Ewa Dominowska,Blaise Aguera y Arcas,Nenad Tomasev,Yun Liu,Renee Wong,Christopher Semturs,S. Sara Mahdavi,Joelle Barral,Dale Webster,Greg S. Corrado,Yossi Matias,Shekoofeh Azizi,Alan Karthikesalingam,Vivek Natarajan

Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.

翻译：近期人工智能（AI）系统已在从围棋到蛋白质折叠等"重大挑战"中取得里程碑式进展。能够检索医学知识、进行推理并给出可与医生媲美的医学答案的能力，长期以来被视为此类重大挑战之一。大语言模型（LLMs）极大推动了医学问答领域的发展；Med-PaLM 是首个在美国医师资格考试（USMLE）风格问题上超过"及格"分数的模型，在 MedQA 数据集上取得 67.2% 的得分。然而，该研究及其他前期工作表明仍有显著改进空间，特别是在将模型答案与临床医生答案对比时。为此，我们提出 Med-PaLM 2，该模型通过融合基础 LLM 改进（PaLM 2）、医学领域微调及包括新型集成优化策略在内的提示技术，弥合了上述差距。Med-PaLM 2 在 MedQA 数据集上取得最高 86.5% 的得分，较 Med-PaLM 提升超过 19%，并创下新最优水平。我们还在 MedMCQA、PubMedQA 及 MMLU 临床主题数据集上观察到接近或超越现有最优水平的性能。针对长篇问题，我们沿临床应用的多个维度开展了详细人工评估。在 1066 个消费者医疗问题的成对比较排序中，医生在关于临床实用性的九个维度中，有八个维度更偏好 Med-PaLM 2 的答案（p < 0.001）。针对新引入的 240 个用于探索 LLM 局限性的长篇"对抗性"问题数据集，我们在每个评估维度上都观察到相较 Med-PaLM 的显著提升（p < 0.001）。尽管仍需进一步研究验证这些模型在真实场景中的有效性，但这些结果凸显了在医学问答领域向医师级性能迈进的快速进展。