Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean reward for each prompt. Statistically, this centering acts as a control variate (baseline), reducing the variance of the policy-gradient estimator. In practice, the mean reward is estimated using per-prompt empirical averages computed from the generations for each prompt in a batch. Motivated by Stein's paradox, we propose shrinkage estimators that combine per-prompt and across-prompt means to improve per-prompt mean estimation accuracy, especially in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient estimators across algorithms. Our baseline is a drop-in replacement for standard per-prompt mean baselines and requires no additional hyperparameters or computation. Empirically, shrinkage baselines consistently outperform empirical-mean baselines, producing lower-variance gradient updates and improved training stability.
翻译:具有可验证奖励的强化学习已成为使用GRPO等策略梯度方法对大型推理模型进行后训练的强大范式。为稳定训练,这些方法通常通过减去每个提示的经验平均奖励来对轨迹奖励进行中心化处理。从统计学角度看,这种中心化充当了控制变量(基线)的作用,降低了策略梯度估计量的方差。在实践中,平均奖励是通过使用批次中每个提示生成结果计算出的每提示经验平均值来估计的。受Stein悖论的启发,我们提出了收缩估计量,该估计量结合了每提示和跨提示均值,以提高每提示均值估计的准确性,特别是在RLVR典型的低生成数量场景下。理论上,我们构建了一种基于收缩的基线,可证明能在不同算法中产生更低方差的策略梯度估计量。我们的基线是标准每提示均值基线的即插即用替代方案,无需额外超参数或计算开销。实证结果表明,收缩基线始终优于经验均值基线,能产生更低方差的梯度更新并提升训练稳定性。