Group Relative Policy Optimization (GRPO), recently introduced by DeepSeek, is a critic-free reinforcement learning algorithm for fine-tuning large language models. GRPO replaces the value function in Proximal Policy Optimization (PPO) with group-normalized rewards while retaining PPO-style token-level importance sampling based on an old policy. Our theoretical analysis reveals that the GRPO update rule estimates the policy gradient at the old policy rather than the current one; however, since the old policy is refreshed every few steps, the resulting discrepancy remains small and the induced bias is negligible in practice. To empirically validate this insight, we conduct an ablation study that entirely removes importance sampling and performs multiple optimization steps using gradients estimated at a fixed old policy. Remarkably, this simplified variant attains performance comparable to standard GRPO. Motivated by this finding, we propose Trajectory-level Importance-Corrected GRPO (TIC-GRPO), a new algorithm that replaces token-level importance ratios with a single trajectory-level probability ratio, thereby yielding an estimate of the current policy gradient while preserving the critic-free structure. Furthermore, we present the first convergence analysis for GRPO-style methods and show that TIC-GRPO converges faster than GRPO. Finally, empirical results across math reasoning and coding tasks demonstrate the superiority of TIC-GRPO.
翻译:DeepSeek 近期提出的组相对策略优化(GRPO)是一种用于微调大语言模型的无评论家强化学习算法。GRPO 将近端策略优化(PPO)中的价值函数替换为组归一化奖励,同时保留了基于旧策略的 PPO 风格令牌级重要性采样。我们的理论分析表明,GRPO 的更新规则是在旧策略而非当前策略处估计策略梯度;然而,由于旧策略每隔几步就会更新,由此产生的差异保持较小,且在实际中引入的偏差可忽略不计。为实证验证这一观点,我们进行了一项消融研究,完全移除了重要性采样,并利用在固定旧策略处估计的梯度执行多步优化。值得注意的是,该简化变体取得了与标准 GRPO 相当的性能。受此发现启发,我们提出了轨迹级重要性校正 GRPO(TIC-GRPO),这是一种新算法,它将令牌级重要性比率替换为单一的轨迹级概率比率,从而在保持无评论家结构的同时,提供了对当前策略梯度的估计。此外,我们首次对 GRPO 类方法进行了收敛性分析,并证明 TIC-GRPO 比 GRPO 收敛更快。最后,在数学推理和代码生成任务上的实证结果均验证了 TIC-GRPO 的优越性。