Reinforcement learning (RL) has become a key driver of language model reasoning. Among RL algorithms, Group Relative Policy Optimization (GRPO) is the de facto standard, avoiding the need for a critic by using per-prompt baselines and variance normalization. Yet why and when this normalization helps remains unclear. In this work, we provide an explanation through the lens of local curvature of the sequence-level policy gradient: standard deviation normalization implements an adaptive gradient. Theoretically, under mild conditions, GRPO enjoys a strictly improved convergence rate over unnormalized REINFORCE, with gains characterized by the average within-prompt reward standard deviation across prompts and iterations. Empirically, our analysis on GSM8K and MATH benchmarks reveals three distinct training phases governed by the interplay between feature orthogonality and reward variance: (I) an early acceleration phase where high variance and orthogonality favor adaptive scaling; (II) a relatively stable transition phase; and (III) a late-stage regime where the loss of orthogonality limits further gains. Together, these results provide a principled account of when std normalization helps in GRPO, and offer broader insights into the design of critic-free RL algorithms.
翻译:强化学习已成为驱动语言模型推理的关键技术。在众多强化学习算法中,分组相对策略优化因其通过逐提示基线和方差归一化避免了价值函数估计器需求,成为实际上的标准方法。然而,这种归一化为何以及何时有效仍不明确。本研究从序列级策略梯度的局部曲率视角提出解释:标准差归一化本质上实现了一种自适应梯度机制。理论分析表明,在温和条件下,GRPO相比未归一化的REINFORCE具有严格更优的收敛速率,其增益由跨提示与训练迭代的提示内奖励标准差平均值所表征。在GSM8K和MATH基准上的实证分析揭示了三个由特征正交性与奖励方差相互作用主导的训练阶段:(I)早期加速阶段:高方差与正交性有利于自适应缩放;(II)相对稳定的过渡阶段;(III)后期阶段:正交性丧失限制了进一步增益。这些结果共同为GRPO中标准差归一化的有效性条件提供了理论依据,并为无价值函数估计器的强化学习算法设计提供了更广泛的启示。