Group-based reinforcement learning methods, like Group Relative Policy Optimization (GRPO), are widely used nowadays to post-train large language models. Despite their empirical success, they exhibit structural mismatches between reward optimization and the underlying training objective. In this paper, we present a theoretical analysis of GRPO style methods by studying them within a unified surrogate formulation. This perspective reveals recurring properties that affect all the methods under analysis: (i) non-uniform group weighting induces systematic gradient biases on shared prefix tokens; (ii) interactions with the AdamW optimizer make training dynamics largely insensitive to reward scaling; and (iii) optimizer momentum can push policy updates beyond the intended clipping region under repeated optimization steps. We believe that these findings highlight fundamental limitations of current approaches and provide principled guidance for the design of future formulations.
翻译:基于群体的强化学习方法,如群体相对策略优化(GRPO),如今被广泛用于大型语言模型的后训练。尽管它们在实证上取得了成功,但这些方法在奖励优化与底层训练目标之间存在着结构性的不匹配。本文通过对GRPO风格方法在一个统一的代理公式框架内进行研究,提出了理论分析。这一视角揭示了影响所有被分析方法的反复出现的特性:(i)非均匀的群体加权会在共享前缀词元上引入系统性的梯度偏差;(ii)与AdamW优化器的交互使得训练动态在很大程度上对奖励缩放不敏感;以及(iii)在重复的优化步骤下,优化器的动量可能将策略更新推至预期的裁剪区域之外。我们相信,这些发现突显了当前方法的根本性局限,并为未来公式的设计提供了原则性指导。