Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RL for transformers on compositional reasoning tasks. Our theory characterizes how the effectiveness of RLVR is governed by the smoothness of the difficulty spectrum. When data contains abrupt discontinuities in difficulty, learning undergoes grokking-type phase transitions, producing prolonged plateaus before progress recurs. In contrast, a smooth difficulty spectrum leads to a relay effect: persistent gradient signals on easier problems elevate the model's capabilities to the point where harder ones become tractable, resulting in steady and continuous improvement. Our theory explains how RLVR can improve performance at the edge of competence, and suggests that appropriately designed data mixtures can yield scalable gains. As a technical contribution, our analysis develops and adapts tools from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.
翻译:可验证奖励强化学习(RLVR)是推动大型推理模型近期取得突破的主要动力。然而,仅基于最终结果的奖励如何帮助克服扩展推理中的长视野障碍,仍然是一个谜。为理解这一问题,我们针对组合推理任务中的Transformer模型,建立了强化学习训练动态的理论。我们的理论揭示了RLVR的有效性如何受难度谱平滑度的支配。当数据中存在难度的突然不连续性时,学习会经历顿悟型相变,在进展重现之前产生漫长的平台期。相反,平滑的难度谱会导致接力效应:在较简单问题上持续的梯度信号将模型能力提升至较难问题变得可处理的水平,从而实现稳定且持续的改进。我们的理论解释了RLVR如何能在能力边缘提升性能,并表明适当设计的数据混合可以产生可扩展的收益。作为一项技术贡献,我们的分析发展并调整了有限群上傅里叶分析的工具以适用于本设定。我们通过合成实验对预测的机制进行了实证验证。