Typical reinforcement learning (RL) methods for LLM reasoning waste compute on hard problems, where correct on-policy traces are rare, policy gradients vanish, and learning stalls. To bootstrap more efficient RL, we consider reusing old sampling FLOPs (from prior inference or RL training) in the form of off-policy traces. Standard off-policy methods supervise against off-policy data, causing instabilities during RL optimization. We introduce PrefixRL, where we condition on the prefix of successful off-policy traces and run on-policy RL to complete them, side-stepping off-policy instabilities. PrefixRL boosts the learning signal on hard problems by modulating the difficulty of the problem through the off-policy prefix length. We prove that the PrefixRL objective is not only consistent with the standard RL objective but also more sample efficient. Empirically, we discover back-generalization: training only on prefixed problems generalizes to out-of-distribution unprefixed performance, with learned strategies often differing from those in the prefix. In our experiments, we source the off-policy traces by rejection sampling with the base model, creating a self-improvement loop. On hard reasoning problems, PrefixRL reaches the same training reward 2x faster than the strongest baseline (SFT on off-policy data then RL), even after accounting for the compute spent on the initial rejection sampling, and increases the final reward by 3x. The gains transfer to held-out benchmarks, and PrefixRL is still effective when off-policy traces are derived from a different model family, validating its flexibility in practical settings.
翻译:典型的大型语言模型推理强化学习方法在困难问题上浪费计算资源,因为正确的同策略轨迹稀少、策略梯度消失且学习停滞。为引导更高效的强化学习,我们考虑以离策略轨迹的形式复用旧有采样FLOPs(来自先前推理或强化学习训练)。标准离策略方法通过离策略数据进行监督,导致强化学习优化过程中的不稳定性。我们提出PrefixRL方法:该方法以成功离策略轨迹的前缀为条件,运行同策略强化学习来完成后续轨迹,从而规避离策略不稳定性。PrefixRL通过离策略前缀长度调节问题难度,从而增强困难问题上的学习信号。我们证明PrefixRL目标不仅与标准强化学习目标一致,而且具有更高的样本效率。实证研究发现反向泛化现象:仅在前缀问题上训练可泛化至分布外无前缀任务的表现,且学习策略常与前缀策略不同。实验中,我们通过基础模型的拒绝采样获取离策略轨迹,形成自我改进循环。在困难推理问题上,即使计入初始拒绝采样的计算开销,PrefixRL达到相同训练奖励的速度仍比最强基线(离策略数据监督微调后强化学习)快2倍,并将最终奖励提升3倍。该增益可迁移至保留基准测试,且当离策略轨迹源自不同模型家族时PrefixRL依然有效,验证了其在实际场景中的灵活性。