Recent advances in large language models (LLMs) highlight the importance of post training techniques for improving reasoning and mathematical ability. Group Relative Policy Optimization (GRPO) has shown promise in this domain by combining group relative advantage estimation, PPO style clipping, and KL regularization. However, its complexity raises the question of whether all components are necessary for fostering reasoning behaviors. We conduct a systematic analysis of GRPO and identify two key findings: (1) incorporating negative feedback is essential training solely on actions above a baseline limits learning; and (2) PPO style constraints, such as policy ratio clipping, are not required to improve mathematical reasoning or performance. Building on these insights, we propose REINFORCE with Group Relative Advantage (RGRA), a simplified variant that retains group relative advantage estimation but removes PPO style clipping and policy ratio terms. Experiments across standard mathematical benchmarks indicate that RGRA has the potential to achieve stronger performance than GRPO. Our results suggest that simpler REINFORCE based approaches can effectively enhance reasoning in LLMs, offering a more transparent and efficient alternative to GRPO.
翻译:近期大语言模型的进展凸显了后训练技术对提升推理与数学能力的重要性。群体相对策略优化通过结合群体相对优势估计、PPO式裁剪和KL正则化展现了潜力。然而其复杂性引发质疑:是否所有组件都对培养推理行为必不可少?我们系统分析GRPO后得出两项关键发现:(1) 引入负反馈至关重要——仅对高于基线的动作进行训练会限制学习效果;(2) PPO式约束(如策略比率裁剪)对提升数学推理或性能并非必需。基于这些见解,我们提出带群体相对优势的REINFORCE简化变体,保留群体相对优势估计但移除PPO式裁剪和策略比率项。在标准数学基准上的实验表明,RGRA有潜力取得优于GRPO的性能。我们的结果提示,基于REINFORCE的简化方法能有效增强LLM推理能力,为GRPO提供了更透明高效的替代方案。