Large reasoning models rely on long chain-of-thought generation to solve complex problems, but extended reasoning often incurs substantial computational cost and can even degrade performance due to overthinking. A key challenge is determining when the model should stop reasoning and produce the final answer. In this work, we study the confidence of intermediate answers during reasoning and observe two characteristic behaviors: correct reasoning trajectories often reach high-confidence answers early, while incorrect rollouts tend to produce long, unproductive reasoning traces and exhibit less reliable confidence dynamics. Motivated by these observations, we propose CoDE-Stop (Confidence Dynamics Early Stop), an early stopping method that leverages the dynamics of intermediate answer confidence to decide when to terminate reasoning, requiring no additional training and easily integrating into existing models. We evaluate CoDE-Stop on diverse reasoning and science benchmarks across multiple models. Compared to prior early stopping methods, it achieves a more favorable accuracy-compute tradeoff and reduces total token usage by 25-50% compared to standard full-length reasoning. In addition, we provide analyses of confidence dynamics during reasoning, offering insights into how confidence changes in both correct and incorrect trajectories.
翻译:大型推理模型依赖长链式思维生成来解决复杂问题,但扩展推理过程通常带来巨大的计算开销,甚至可能因过度思考而导致性能下降。关键挑战在于如何确定模型何时应停止推理并输出最终答案。本研究聚焦推理过程中中间答案置信度的变化规律,观察到两类特征性行为:正确推理轨迹往往在早期即达到高置信度答案,而错误推理轨迹则倾向于产生冗长无效的推理链路,且置信度动态可靠性较低。基于上述观察,我们提出CoDE-Stop(置信度动态早停法),该方法利用中间答案置信度的动态变化决定推理终止时机,无需额外训练即可轻松集成至现有模型。我们在多个模型的推理与科学基准测试上评估了CoDE-Stop。与现有早停方法相比,该方法实现了更优的精度-计算量权衡,相较于标准全长推理可将总令牌使用量降低25-50%。此外,我们提供了推理过程中置信度动态的分析,揭示了正确与错误轨迹中置信度的变化规律。