Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has found significant empirical success. However, a principled understanding of why it works is lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a new quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. Importantly, our theory holds flexibly for any policy-gradient algorithm and so characterizes the dynamics of popular approaches such as REINFORCE and GRPO. We validate these predictions through controlled bandit simulations and language model experiments on post-training Qwen2.5-Math-7B with GRPO.
翻译:基于可验证奖励的强化学习(RLVR)通过使用简单的二元反馈对大型语言模型进行后训练,已在实证中取得了显著成功。然而,对其成功原理的理论理解尚不充分。本文通过在全响应(轨迹)层和词元层两个层面分析RLVR的训练过程,为其建立了理论根基。我们提出一个名为"梯度差距"的新量,该量形式化描述了响应空间中从低奖励区域向高奖励区域的改进方向。我们证明,收敛的关键在于更新方向与梯度差距的对齐程度。此外,我们基于梯度差距的大小推导出明确的步长阈值:当步长低于该阈值时学习收敛,超过该阈值时性能崩溃。进一步,理论预测了临界步长应如何随响应长度和成功率进行缩放,从而解释为何长度归一化等实用启发式方法能提升稳定性,并表明在固定学习率下成功率会严格停滞在100%以下。重要的是,本理论可灵活适用于任意策略梯度算法,因此刻画了REINFORCE和GRPO等主流方法的动力学特性。我们通过受控赌博机仿真实验以及基于GRPO对Qwen2.5-Math-7B进行后训练的语言模型实验,验证了上述预测。