Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they uniformly treat all reasoning steps and lack fine-grained signals to distinguish redundancy from necessity. Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model's intrinsic attention signals for step-level credit assignment. We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones. By leveraging the attention scores of these heads, We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps. Experimental results show that ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.
翻译:通过强化学习与可验证奖励(RLVR)训练的大型推理模型在复杂推理任务上表现出色,但常出现过度思考现象,即生成冗余推理步骤而未带来性能提升。现有的轨迹级长度惩罚方法通常难以有效缩短推理长度且会降低准确率,因为它们对所有推理步骤进行统一处理,缺乏细粒度信号来区分冗余与必要步骤。同时,过程监督方法通常资源消耗大,且存在信用分配不准确的问题。为解决这些问题,我们提出了ATTNPO——一种低开销的过程监督强化学习框架,该框架利用模型固有的注意力信号实现步骤级信用分配。我们首先识别出一组特殊的注意力头,它们能自然聚焦于关键步骤并抑制冗余步骤。通过利用这些注意力头的注意力分数,我们采用两种子策略来缓解过度思考:通过抑制冗余步骤来减少推理长度,同时通过降低对关键步骤的惩罚来保持准确性。实验结果表明,ATTNPO在9个基准测试中显著缩短了推理长度,同时大幅提升了性能。