Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has found significant empirical success. However, a principled understanding of why it works is lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a new quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. Importantly, our theory holds flexibly for any policy-gradient algorithm and so characterizes the dynamics of popular approaches such as REINFORCE and GRPO. We validate these predictions through controlled bandit simulations and language model experiments on post-training Qwen2.5-Math-7B with GRPO.
翻译:强化学习与可验证奖励(RLVR)通过使用简单的二元反馈对大语言模型进行后训练,已在实证中取得显著成功。然而,其为何有效的原理性理解尚不完善。本文通过分析RLVR在全响应(轨迹)和词元两个层面的训练过程,为其建立了理论基础。我们分析的核心是一个称为“梯度间隙”的新量,它形式化了从响应空间低奖励区域向高奖励区域改进的方向。我们证明,收敛性关键取决于更新方向与该梯度间隙的对齐。此外,我们基于梯度间隙的幅度推导出一个尖锐的步长阈值:低于该阈值,学习收敛;高于该阈值,性能崩溃。我们的理论进一步预测了临界步长必须如何随响应长度和成功率缩放,从而解释了诸如长度归一化等实用启发式方法为何能提高稳定性,并表明在固定学习率下,成功率可能严格停滞在$100\%$以下。重要的是,我们的理论灵活适用于任何策略梯度算法,因此刻画了诸如REINFORCE和GRPO等流行方法的动力学特性。我们通过受控的赌博机模拟以及在Qwen2.5-Math-7B模型上使用GRPO进行后训练的语言模型实验,验证了这些预测。