Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning abilities of large language models (LLMs), yet training often stalls as problems become saturated. We identify the core challenge as the poor accessibility of informative failures: learning signals exist but are rarely encountered during standard rollouts. To address this, we propose failure-prefix conditioning, a simple and effective method for learning from saturated problems. Rather than starting from the original question, our approach reallocates exploration by conditioning training on prefixes derived from rare incorrect reasoning trajectories, thereby exposing the model to failure-prone states. We observe that failure-prefix conditioning yields performance gains matching those of training on medium-difficulty problems, while preserving token efficiency. Furthermore, we analyze the model's robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results suggest that failure-prefix conditioning offers an effective pathway to extend RLVR training on saturated problems.
翻译:尽管可验证奖励的强化学习(RLVR)显著提升了大型语言模型(LLMs)的推理能力,但随着问题趋于饱和,训练过程常陷入停滞。我们认为其核心挑战在于信息性失败的可及性差:学习信号虽然存在,但在标准推演过程中却鲜少遇到。为解决此问题,我们提出了失败前缀条件化——一种从饱和问题中学习的简单而有效的方法。我们的方法并非从原始问题开始,而是通过将训练条件化于从罕见错误推理轨迹中提取的前缀来重新分配探索,从而使模型暴露于易失败状态。我们观察到,失败前缀条件化带来的性能提升与在中等难度问题上训练所获增益相当,同时保持了标记效率。此外,我们分析了模型的鲁棒性,发现该方法降低了在误导性失败前缀下的性能退化,尽管在遵循正确早期推理方面存在轻微的权衡。最后,我们证明了一种在训练期间更新失败前缀的迭代方法,能够在性能平台期后实现额外的提升。总体而言,我们的结果表明,失败前缀条件化为在饱和问题上扩展RLVR训练提供了一条有效途径。