m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models

Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1, a simple yet effective approach that increases a model's medical reasoning capability at inference. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32B model rivals previous 70B-scale medical LLMs. However, we identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which extends test-time computation through iterative prompts, helps models double-check answers but does not necessarily improve the overall medical QA performance and, in some cases, even introduces errors into previously correct responses. Our case-by-case analysis identifies insufficient medical knowledge as a key bottleneck that prevents further performance gains through test-time scaling. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding, enabling continued performance improvements, particularly on challenging medical benchmarks where smaller models reach saturation. These findings underscore fundamental differences between medical and mathematical reasoning in LLMs, highlighting that enriched medical knowledge, other than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling.

翻译：测试时缩放已成为增强大型语言模型推理能力的有力技术。然而，其在医学推理中的有效性仍不确定，因为医学领域在知识表征和决策过程方面与数学任务存在根本性差异。本文首次对测试时缩放在医学推理中的应用进行全面研究，并提出m1——一种简单而有效的方法，可在推理阶段提升模型的医学推理能力。我们在多样化医学任务上的评估表明，测试时缩放能持续增强医学推理能力，使参数规模低于100亿的轻量级微调模型达到新的最先进性能，而我们的320亿参数模型可与先前700亿规模的医学大语言模型相媲美。然而，我们发现约4K推理令牌的预算为最优值，超过此阈值可能因过度思考导致性能下降。通过迭代提示扩展测试时计算的预算强制方法，虽能帮助模型复核答案，但未必能提升整体医学问答性能，在某些情况下甚至会将错误引入先前正确的回答中。我们的逐案分析表明，医学知识不足是阻碍通过测试时缩放获得进一步性能提升的关键瓶颈。研究发现，扩大数据规模、提升数据质量及扩展模型容量能持续增强医学知识基础，从而实现持续的性能改进，尤其在较小模型已达饱和的挑战性医学基准测试中表现显著。这些发现揭示了大语言模型中医学推理与数学推理的根本差异，强调要实现测试时缩放的优势，除了增加推理深度外，丰富的医学知识同样至关重要。