From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

翻译：诸如Medprompt的运行时引导策略对于指导大型语言模型（LLM）在挑战性任务中取得顶尖性能具有重要价值。Medprompt研究表明，通过设计提示词来激发包含思维链推理和集成学习的运行时策略，通用LLM能够在医学等专业领域实现最先进的性能。OpenAI的o1-preview模型代表了一种新范式，该模型被设计为在生成最终响应前进行运行时推理。我们旨在探究o1-preview模型在多样化医疗挑战问题基准测试中的表现。继基于GPT-4的Medprompt研究之后，我们系统评估了o1-preview模型在各类医疗基准上的性能。值得注意的是，即使不采用提示工程技术，o1-preview模型也大幅超越了使用Medprompt的GPT-4系列模型。我们进一步系统研究了以Medprompt为代表的经典提示工程策略在推理模型新范式下的有效性。研究发现，少样本提示会损害o1模型的性能，这表明上下文学习可能不再是面向原生推理模型的有效引导方法。虽然集成学习仍然可行，但其资源消耗大且需要精细的成本-性能优化。我们对不同运行时策略的成本与准确度分析揭示了一个帕累托前沿：GPT-4o代表更具成本效益的选择，而o1-preview则以更高成本实现了最先进的性能。尽管o1-preview提供了顶级性能，但在特定场景下，采用Medprompt等引导策略的GPT-4o仍具有应用价值。此外，我们注意到o1-preview模型已在许多现有医疗基准上接近性能饱和，这凸显了开发新型挑战性基准的必要性。最后，我们对LLM推理时计算的总体发展方向进行了展望。