Reinforcement learning (RL) has improved the reasoning abilities of large language models (LLMs), yet state-of-the-art methods still fail to learn on many training problems. On hard problems, on-policy RL rarely explores even a single correct rollout, yielding zero reward and no learning signal for driving improvement. We find that natural solutions to remedy this exploration problem from classical RL, such as entropy bonuses, more permissive clipping of the importance ratio, or direct optimization of pass@k objectives, do not resolve this issue and often destabilize optimization without improving solvability. A natural alternative is to leverage transfer from easier problems. However, we show that mixing easy and hard problems during RL training is counterproductive due to ray interference, where optimization focuses on already-solvable problems in a way that actively inhibits progress on harder ones. To address this challenge, we introduce Privileged On-Policy Exploration (POPE), an approach that leverages human- or other oracle solutions as privileged information to guide exploration on hard problems, unlike methods that use oracle solutions as training targets (e.g., off-policy RL methods or warmstarting from SFT). POPE augments hard problems with prefixes of oracle solutions, enabling RL to obtain non-zero rewards during guided rollouts. Crucially, the resulting behaviors transfer back to the original, unguided problems through a synergy between instruction-following and reasoning. Empirically, POPE expands the set of solvable problems and substantially improves performance on challenging reasoning benchmarks.
翻译:强化学习(RL)提升了大型语言模型(LLM)的推理能力,然而现有最先进方法仍无法在众多训练问题上有效学习。对于难题,策略内RL几乎从未探索出任何正确的轨迹,导致零奖励且缺乏驱动改进的学习信号。我们发现,经典RL中用于缓解此探索问题的自然解决方案——如熵奖励、重要性比率更宽松的裁剪或直接优化pass@k目标——均未能解决该问题,反而常在不提升可解性的情况下破坏优化稳定性。一种自然的替代方案是利用从简单问题中迁移知识。然而,我们证明在RL训练中混合简单与困难问题会产生反效果,这是由于优化会聚焦于已可解问题,形成射线干扰,从而主动阻碍对更难问题的进展。为应对这一挑战,我们提出特权策略探索(POPE)方法,该方法利用人类或其他预言机解作为特权信息来引导对难题的探索,这与将预言机解作为训练目标的方法(如策略外RL方法或从监督微调预热)不同。POPE通过为难题添加预言机解的前缀,使RL在引导式轨迹中获得非零奖励。关键在于,通过指令遵循与推理之间的协同作用,由此产生的行为能够迁移回原始的非引导问题。实验表明,POPE显著扩展了可解问题集,并在具有挑战性的推理基准测试中大幅提升了性能。