This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need for sampling multiple completions for each question. Our experiment and theoretical analysis reveals that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experimental results demonstrate that CPPO achieves up to $8.32\times$ speedup on GSM8K and $3.51\times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at https://github.com/lzhxmu/CPPO.
翻译:本文提出完成剪枝策略优化(CPPO)方法,以加速基于群体相对策略优化(GRPO)的推理模型训练。GRPO虽然有效,但由于每个问题需要采样多个完成结果,导致训练成本高昂。我们的实验与理论分析表明,完成结果的数量影响模型精度,同时会成倍增加训练时间,且并非所有完成结果对策略训练的贡献均等——其贡献取决于它们的相对优势。为解决这些问题,我们提出CPPO方法,该方法剪除具有低绝对优势的完成结果,显著减少梯度计算与更新所需的数量。此外,我们引入动态完成分配策略,通过纳入额外问题以最大化GPU利用率,进一步提升训练效率。实验结果表明,在保持甚至提升精度的前提下,CPPO在GSM8K数据集上实现了高达$8.32\times$的加速,在Math数据集上实现了$3.51\times$的加速。代码已发布于 https://github.com/lzhxmu/CPPO。