Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs). However, standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect and uniformly reinforcing all steps when it is correct. As a result, correct intermediate steps may be discouraged in failed traces, while spurious steps may be reinforced in successful ones. We refer to this failure mode as the problem of credit assignment. While a natural remedy is to train a process reward model, accurately optimizing such models to identify corrective reasoning steps remains challenging. We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces by proposing short, targeted corrections that steer trajectories toward higher reward. Using reference solutions commonly available in mathematical reasoning datasets and exploiting the fact that verifying a model-generated solution is easier than generating a correct one from scratch, the model identifies the first error in its reasoning and proposes a single-step intervention to redirect the trajectory toward the correct solution. We then apply supervised fine-tuning (SFT) to the on-policy rollout up to the point of error concatenated with the intervention, localizing error to the specific step that caused failure. We show that the resulting model serves as a far better initialization for RL training. After running InT and subsequent fine-tuning with RL, we improve accuracy by nearly 14% over a 4B-parameter base model on IMO-AnswerBench, outperforming larger open-source models such as gpt-oss-20b.
翻译:结果奖励强化学习已被证明能有效提升大语言模型的推理能力。然而,标准强化学习仅在最终答案层面进行信用分配:当结果错误时惩罚整个推理轨迹,当结果正确时则均匀强化所有步骤。这导致错误轨迹中的正确中间步骤可能被抑制,而成功轨迹中的无效步骤反而被强化。我们将此失效模式称为信用分配问题。虽然训练过程奖励模型是一种自然的补救措施,但准确优化此类模型以识别纠正性推理步骤仍具挑战性。本文提出干预训练,该训练范式通过模型自主提出简短、有针对性的修正来引导轨迹获得更高奖励,从而在其自身推理轨迹上实现细粒度信用分配。利用数学推理数据集中普遍存在的参考答案,并基于验证模型生成解比从头生成正确解更简单的事实,模型首先识别其推理中的首个错误,并提出单步干预以将轨迹导向正确解。随后,我们对错误发生点之前的策略轨迹与干预步骤进行拼接,并实施监督微调,从而将错误定位至导致失败的具体步骤。研究表明,所得模型可作为强化学习训练的更优初始化起点。经过干预训练及后续强化学习微调,我们在IMO-AnswerBench基准上将4B参数基模型的准确率提升近14%,其表现超越了gpt-oss-20b等更大规模的开源模型。