Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single outcome reward uniformly across all tokens in a trajectory. Such uniform assignment ignores which steps contributed to success or failure. Improving credit assignment can address this limitation by enabling targeted refinement of faulty reasoning steps, rather than updating entire trajectories uniformly. Resets are one such simple mechanism, enabling more precise credit assignment by returning to an intermediate state and resampling counterfactual continuations, so that outcome differences can be attributed to decisions made at that point. We propose two such methods: Random-Reset Policy Optimization (RRPO), where reset states are drawn randomly from reasoning steps, and Self-Reset Policy Optimization (SRPO), where the model self-localizes the erroneous step in an incorrect trajectory and resets there. We analyze these methods within the Conservative Policy Iteration (CPI) framework. Extending CPI with a credit-assignment oracle that targets improvable states yields provable improvements over random resets. Across models and reasoning benchmarks, SRPO consistently outperforms standard GRPO and RRPO by sampling multiple suffix continuations at a self-localized reset and learning from their rewards, using only the model itself with no external supervision.
翻译:当代基于可验证奖励的强化学习方法通过将单一结果奖励均匀分配给轨迹中的所有词元,对多步推理任务的语言模型进行后训练。这种均匀分配忽略了哪些步骤对成功或失败做出了贡献。改进信用分配可以通过定向修正错误推理步骤(而非对整个轨迹执行均匀更新)来解决这一局限。重置是一种简单的机制,通过返回到中间状态并重新采样反事实延续,使结果差异可归因于该决策点,从而实现更精确的信用分配。我们提出两种方法:随机重置策略优化(RRPO),其中重置状态从推理步骤中随机抽取;以及自重置策略优化(SRPO),其中模型自动定位错误轨迹中的错误步骤并在此处重置。我们在保守策略迭代(CPI)框架下分析这些方法。通过引入面向可改进状态进行定向优化的信用分配或谱器来扩展CPI,可证明其性能优于随机重置。在多种模型和推理基准测试中,SRPO通过在自定位重置点采样多个后缀延续并基于其奖励进行学习,仅依赖模型自身无需外部监督,始终优于标准GRPO和RRPO。