Group Relative Policy Optimization (GRPO) has emerged as a popular algorithm for reinforcement learning with large language models (LLMs). However, upon analyzing its clipping mechanism, we argue that it is suboptimal in certain scenarios. With appropriate modifications, GRPO can be significantly enhanced to improve both flexibility and generalization. To this end, we propose Adaptive-Boundary-Clipping GRPO (ABC-GRPO), an asymmetric and adaptive refinement of the original GRPO framework. We demonstrate that ABC-GRPO achieves superior performance over standard GRPO on mathematical reasoning tasks using the Qwen3 LLMs. Moreover, ABC-GRPO maintains substantially higher entropy throughout training, thereby preserving the model's exploration capacity and mitigating premature convergence. The implementation code is available online to ease reproducibility https://github.com/chi2liu/ABC-GRPO.
翻译:群组相对策略优化(GRPO)已成为大型语言模型(LLM)强化学习中的一种流行算法。然而,通过分析其裁剪机制,我们认为在某些场景下该机制并非最优。经过适当修改,GRPO可以在灵活性和泛化性方面得到显著提升。为此,我们提出了自适应边界裁剪GRPO(ABC-GRPO),这是对原始GRPO框架的一种非对称且自适应的改进。我们证明,在使用Qwen3 LLMs的数学推理任务上,ABC-GRPO实现了优于标准GRPO的性能。此外,ABC-GRPO在整个训练过程中保持了显著更高的熵,从而保留了模型的探索能力并缓解了过早收敛问题。实现代码已在线公开以方便复现:https://github.com/chi2liu/ABC-GRPO。