Long-form chain-of-thought reasoning can improve LLM performance on complex tasks, but models often continue generating unnecessary reasoning after a correct answer has emerged. We refer to this behavior as overthinking. We study this phenomenon from the perspective of GRPO-style reinforcement learning (RL) post-training, framing it as a training-time credit-assignment problem rather than merely a decoding-time stopping problem. In rollouts sampled at the onset of GRPO training, we observe that successful trajectories can exhibit a slightly higher degree of overthinking than unsuccessful trajectories for the same prompts. This early imbalance provides a starting point for an undesirable feedback loop: because GRPO assigns sequence-level credit, it cannot distinguish the solution-reaching prefix from the unnecessary continuation that lengthens a successful trajectory. Both receive positive update signal, allowing the initial imbalance to grow into more severe overthinking during training. To address this issue, we introduce Dynamic Rollout Editing (DRE), a training-time intervention for successful trajectories that continue thinking after answer emergence. DRE preserves the accepted verified prefix, edits the remaining thinking, and prefers the edited trajectory within the same RL group, weakening the preference signal for unnecessary thinking without penalizing the reasoning needed to reach the answer. Experiments across diverse tasks show the effectiveness of DRE.
翻译:长链思维推理能够提升大语言模型在复杂任务上的表现,但模型在得出正确答案后往往仍会继续生成不必要的推理内容。我们将这种行为称为"过度思考"。我们从GRPO风格的强化学习后训练视角研究这一现象,将其定位为训练阶段的信用分配问题,而非单纯的解码阶段终止问题。在GRPO训练初始阶段采样的展开序列中,我们观察到成功轨迹相较于同一提示下的失败轨迹可能表现出稍高的过度思考程度。这种早期的失衡为不良反馈循环埋下伏笔:由于GRPO进行序列级信用分配,无法将得到解的初始部分与延长成功轨迹的非必要延续作区分——两者均接收正向更新信号,使得初始的失衡在训练过程中逐渐演变为更严重的过度思考。针对此问题,我们提出动态展开编辑——一种针对在得出答案后仍持续思考的成功轨迹的训练阶段干预措施。DRE保留已验证的正确前缀,编辑剩余推理内容,并在同一RL组中优先选择编辑后的轨迹,从而在既不惩罚得到答案所需推理的前提下,弱化对非必要思考的偏好信号。多任务实验验证了DRE的有效性。