Reinforcement learning has become a central paradigm for improving LLM reasoning. However, existing methods use a single policy to produce both inference responses and training optimization trajectories. The objective conflict between generating stable inference responses and diverse training trajectories leads to insufficient exploration, which harms reasoning capability. In this paper, to address the problem, we propose R$^2$PO (Residual Rollout Policy Optimization), which introduces a lightweight Residual Rollout-Head atop the policy to decouple training trajectories from inference responses, enabling controlled trajectory diversification during training while keeping inference generation stable. Experiments across multiple benchmarks show that our method consistently outperforms baselines, achieving average accuracy gains of 3.4% on MATH-500 and 1.3% on APPS, while also reducing formatting errors and mitigating length bias for stable optimization. Our code is publicly available at https://github.com/RRPO-ARR/Code.
翻译:强化学习已成为提升大型语言模型推理能力的核心范式。然而,现有方法使用单一策略同时生成推理响应和训练优化轨迹。生成稳定推理响应与多样化训练轨迹之间的目标冲突导致探索不足,从而损害推理能力。本文针对该问题,提出R$^2$PO(残差推演策略优化),该方法在策略之上引入一个轻量级的残差推演头,以解耦训练轨迹与推理响应,从而在训练期间实现可控的轨迹多样化,同时保持推理生成的稳定性。在多个基准测试上的实验表明,我们的方法始终优于基线模型,在MATH-500上平均准确率提升3.4%,在APPS上提升1.3%,同时减少了格式错误并缓解了长度偏差以实现稳定优化。我们的代码公开于 https://github.com/RRPO-ARR/Code。