Large Reasoning Models (LRMs) have achieved impressive performance on challenging tasks, yet their deep reasoning often incurs substantial computational costs. To achieve efficient reasoning, existing reinforcement learning methods still struggle to construct short reasoning path during the rollout stage, limiting effective learning. Inspired by Evidence Accumulation Models, we find that LRMs have accumulated sufficient information early in reasoning, making further reasoning steps redundant. Based on this insight, we propose Just-Enough Thinking (JET), which trains models to proactively terminate unnecessary reasoning. JET performs trajectory truncation during rollout to expose the model to short, distributionally consistent reasoning paths. Besides, it uses a quality-controlled length reward to better encourage concise reasoning while maintaining correctness. Extensive experiments demonstrate that JET significantly improves reasoning efficiency without sacrificing accuracy. Especially, DeepSeek-Distill-Qwen-1.5B achieves a 4.6% accuracy gain while reducing output length by 46.3% on the Olympiad benchmark. Our code is available in the GitHub.
翻译:大型推理模型(LRMs)在解决复杂任务上取得了显著成效,但其深度推理通常带来高昂的计算成本。为实现高效推理,现有强化学习方法仍难以在rollout阶段构建短推理路径,限制了有效学习。受证据累积模型启发,我们发现LRMs在推理早期已积累足够信息,后续推理步骤变得多余。基于此洞察,我们提出“恰好思考”(JET)方法,训练模型主动终止不必要的推理。JET在rollout阶段执行轨迹截断,使模型接触分布一致的短推理路径。此外,它采用质量控制长度奖励,在保持正确性的同时更有效地鼓励简洁推理。大量实验表明,JET在不牺牲准确率的情况下显著提升了推理效率。特别地,DeepSeek-Distill-Qwen-1.5B模型在奥赛基准上实现了4.6%的准确率提升,同时将输出长度减少46.3%。我们的代码已在GitHub上开源。