Credit Assignment with Resets in Language Model Reasoning

Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single outcome reward uniformly across all tokens in a trajectory. Such uniform assignment ignores which steps contributed to success or failure. Improving credit assignment can address this limitation by enabling targeted refinement of faulty reasoning steps, rather than updating entire trajectories uniformly. Resets are one such simple mechanism, enabling more precise credit assignment by returning to an intermediate state and resampling counterfactual continuations, so that outcome differences can be attributed to decisions made at that point. We propose two such methods: Random-Reset Policy Optimization (RRPO), where reset states are drawn randomly from reasoning steps, and Self-Reset Policy Optimization (SRPO), where the model self-localizes the erroneous step in an incorrect trajectory and resets there. We analyze these methods within the Conservative Policy Iteration (CPI) framework. Extending CPI with a credit-assignment oracle that targets improvable states yields provable improvements over random resets. Across models and reasoning benchmarks, SRPO consistently outperforms standard GRPO and RRPO by sampling multiple suffix continuations at a self-localized reset and learning from their rewards, using only the model itself with no external supervision.

翻译：当代基于可验证奖励的强化学习方法通过将单一结果奖励均匀分配给轨迹中的所有词元，对多步推理任务的语言模型进行后训练。这种均匀分配忽略了哪些步骤对成功或失败做出了贡献。改进信用分配可以通过定向修正错误推理步骤（而非对整个轨迹执行均匀更新）来解决这一局限。重置是一种简单的机制，通过返回到中间状态并重新采样反事实延续，使结果差异可归因于该决策点，从而实现更精确的信用分配。我们提出两种方法：随机重置策略优化（RRPO），其中重置状态从推理步骤中随机抽取；以及自重置策略优化（SRPO），其中模型自动定位错误轨迹中的错误步骤并在此处重置。我们在保守策略迭代（CPI）框架下分析这些方法。通过引入面向可改进状态进行定向优化的信用分配或谱器来扩展CPI，可证明其性能优于随机重置。在多种模型和推理基准测试中，SRPO通过在自定位重置点采样多个后缀延续并基于其奖励进行学习，仅依赖模型自身无需外部监督，始终优于标准GRPO和RRPO。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

综述 | 推理时控制：可信大语言模型的运行时治理全景

专知会员服务

8+阅读 · 5月31日

[ICML 2026] SOL：让大模型把算力花在关键Token上：自优化语言模型

专知会员服务

7+阅读 · 5月12日

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

专知会员服务

13+阅读 · 2025年12月19日

【牛津大学博士论文】通过增加推理计算量来改进大型语言模型的系统与方法

专知会员服务

16+阅读 · 2025年11月23日