Large reasoning models enhanced by reinforcement learning with verifiable rewards have achieved significant performance gains by extending their chain-of-thought. However, this paradigm incurs substantial deployment costs as models often exhibit excessive verbosity on simple queries. Existing efficient reasoning methods relying on explicit length penalties often introduce optimization conflicts and leave the generative mechanisms driving overthinking largely unexamined. In this paper, we identify a phenomenon termed length shift where models increasingly generate unnecessary reasoning on trivial inputs during training. To address this, we introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens. This method targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities for complex problems. To complement this intervention and ensure stable convergence, we further incorporate auxiliary KL regularization and predictive dynamic sampling. Experimental results across multiple model scales demonstrate that our approach significantly pushes the efficiency-performance Pareto frontier outward. Notably, on the AIME-24, our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy and surpassing state-of-the-art efficient reasoning methods.
翻译:通过可验证奖励的强化学习增强的大型推理模型,通过扩展其思维链已取得显著的性能提升。然而,这种范式带来了巨大的部署成本,因为模型在简单查询上常常表现出过度冗长。现有依赖显式长度惩罚的高效推理方法通常会引入优化冲突,且对驱动过度思考的生成机制基本未作深入探究。本文中,我们识别出一种称为“长度偏移”的现象,即模型在训练过程中对简单输入逐渐生成不必要的推理步骤。为解决此问题,我们提出动态异常截断,一种训练时干预方法,能够选择性地抑制冗余标记。该方法仅针对完全正确的轨迹组中响应长度的极端尾部进行处理,同时保留模型处理复杂问题的长程推理能力。为配合此干预措施并确保稳定收敛,我们进一步引入了辅助KL正则化和预测性动态采样。多个模型规模的实验结果表明,我们的方法显著推动了效率-性能帕累托前沿向外扩展。值得注意的是,在AIME-24数据集上,相较于初始策略,我们的方法在提升准确率的同时减少了78%的推理标记使用量,并超越了当前最先进的高效推理方法。