Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address the above limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces number of rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93\texttimes{} fewer rollouts and an up to 4.19\texttimes{} reduction in wall-clock time to match GRPO's best accuracy. Project website is available at https://slow-fast-po.github.io/.
翻译:强化学习已成为提升大语言模型推理能力的核心方法。然而,如分组相对策略优化等在线策略算法在训练初期常表现不佳:低质量轨迹产生的噪声梯度会导致更新不稳定与探索低效。本文提出慢-快策略优化,这是一个通过将每一步分解为三个阶段来解决上述局限的简洁高效框架:在相同批次上进行短时快速轨迹内步探索、控制离线策略偏移的重定位机制,以及最终的慢速校正。这种“重定位-后更新”的设计保持目标函数与轨迹生成过程不变,使得SFPO能够即插即用地兼容现有策略梯度流程。大量实验表明,SFPO能持续提升训练稳定性、减少轨迹采样次数并加速推理强化学习的收敛。具体而言,在数学推理基准测试中,其平均表现较GRPO提升最高达2.80分。为达到GRPO最佳精度,SFPO所需轨迹采样次数最多减少至1/4.93,实际运行时间最多缩短至1/4.19。项目网站详见 https://slow-fast-po.github.io/。