Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

Harsha Nori,Yin Tat Lee,Sheng Zhang,Dean Carignan,Richard Edgar,Nicolo Fusi,Nicholas King,Jonathan Larson,Yuanzhi Li,Weishung Liu,Renqian Luo,Scott Mayer McKinney,Robert Osazuwa Ness,Hoifung Poon,Tao Qin,Naoto Usuyama,Chris White,Eric Horvitz

from arxiv, 21 pages, 7 figures

Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expert-curated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as Med-PaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.

翻译：诸如GPT-4等通用基础模型已在广泛领域和任务中展现出令人惊讶的能力。然而，普遍存在一种假设，认为它们无法匹敌微调模型的专家级能力。例如，迄今为止大多数医学能力基准探索都依赖领域特定训练，如BioGPT和Med-PaLM相关研究所示。我们基于GPT-4在未接受特殊训练情况下应对医学挑战基准能力的前期研究进行拓展。不同于采用简单提示凸显模型开箱即用能力，我们系统性地探索了提示工程。研究发现，提示创新能解锁更深层次的专家能力，并证明GPT-4轻松超越此前医学基准的领先结果。我们所探索的提示方法具有普适性，无需特定领域知识，因此无需专家策划内容。实验设计严格控制了提示工程过程中的过拟合问题。我们提出基于多种提示策略组合的Medprompt方法。运用Medprompt，GPT-4在MultiMedQA套件全部九个基准数据集上均取得最优结果。该方法以数量级更少的模型调用次数显著超越Med-PaLM 2等领先专家模型。通过Medprompt引导GPT-4，在MedQA数据集上相较此前最佳专家模型方法实现27%的错误率降低，并首次突破90%准确率阈值。除医学问题外，我们还展示了Medprompt在其他领域的泛化能力，通过电气工程、机器学习、哲学、会计学、法学、护理学及临床心理学等学科考试研究，为该方法的广泛适用性提供了证据。