Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reasoning Models. However, standard outcome-based supervision suffers from a critical limitation that penalizes trajectories that are largely correct but fail due to several missteps as heavily as completely erroneous ones. This coarse feedback signal causes the model to discard valuable largely correct rollouts, leading to a degradation in rollout diversity that prematurely narrows the exploration space. Process Reward Models have demonstrated efficacy in providing reliable step-wise verification for test-time scaling, naively integrating these signals into RLVR as dense rewards proves ineffective.Prior methods attempt to introduce off-policy guided whole-trajectory replacement that often outside the policy model's distribution, but still fail to utilize the largely correct rollouts generated by the model itself and thus do not effectively mitigate the narrowing of the exploration space. To address these issues, we propose SCOPE (Step-wise Correction for On-Policy Exploration), a novel framework that utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applies fine-grained, step-wise off-policy rectification. By applying precise refinement on partially correct rollout, our method effectively salvages partially correct trajectories and increases diversity score by 13.5%, thereby sustaining a broad exploration space. Extensive experiments demonstrate that our approach establishes new state-of-the-art results, achieving an average accuracy of 46.6% on math reasoning and exhibiting robust generalization with 53.4% accuracy on out-of-distribution reasoning tasks.

翻译：基于可验证奖励的强化学习（RLVR）已成为增强大型推理模型复杂推理能力的强大范式。然而，标准基于结果的监督存在一个关键局限：它对那些大部分正确但因若干失误而失败的轨迹，与完全错误的轨迹施加同样严厉的惩罚。这种粗糙的反馈信号导致模型丢弃有价值的大部分正确的推演，造成推演多样性下降，从而过早地缩小了探索空间。过程奖励模型已证明能在测试时扩展中提供可靠的逐步验证，但将这些信号作为密集奖励简单集成到RLVR中被证明是无效的。先前的方法尝试引入离策略指导的整轨迹替换，但这些轨迹常超出策略模型的分布范围，且仍未能利用模型自身生成的大部分正确的推演，因此未能有效缓解探索空间的缩小。为解决这些问题，我们提出了SCOPE（面向策略探索的逐步校正），这是一个新颖的框架，它利用过程奖励模型精确定位次优推演中的首个错误步骤，并应用细粒度的、逐步的离策略修正。通过对部分正确的推演进行精确优化，我们的方法有效挽救了部分正确的轨迹，并将多样性分数提升了13.5%，从而维持了广阔的探索空间。大量实验表明，我们的方法确立了新的最先进结果，在数学推理上达到平均46.6%的准确率，并在分布外推理任务上表现出稳健的泛化能力，准确率达到53.4%。