Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving the reasoning capabilities of language models, but often fails to improve small models due to sparse rewards on difficult tasks. Existing works mitigate this issue by leveraging a larger model, either to provide hints for rollouts or to provide dense reward signals through knowledge distillation (KD). However, this assumes the existence of such an oracle, and training one can significantly increase total training time. In this work, we propose CoDistill-GRPO, a co-distillation algorithm that simultaneously trains a large and a small model by maximizing carefully designed GRPO objectives. The two models learn from each other: the small model uses an on-policy KD reward to learn from the large model's distribution, while the large model is updated using rollouts generated by the small model with importance reweighting, reducing the computational overhead of rollout generation. We show that CoDistill-GRPO substantially improves small model performance over standard GRPO on mathematical benchmarks across both Qwen and Llama models. Specifically, with Qwen2.5-Math-1.5B, we observe an accuracy increase of over 11.6 percentage points over the base model and an additional 6.0 percentage points over GRPO on the Minerva dataset. Interestingly, the larger model (Qwen2.5-Math-7B) trained with CoDistill-GRPO nearly matches standard GRPO performance despite training on small-model rollouts. This highlights CoDistill-GRPO as a cost-effective alternative to GRPO for larger models, yielding an approximate 18% speedup, which may be of independent interest.
翻译:组相对策略优化(GRPO)已成为提升语言模型推理能力的有效算法,但由于困难任务上的稀疏奖励,该方法往往难以改进小模型性能。现有工作通过借助更大模型缓解此问题,具体方式为:在模型采样阶段提供提示,或通过知识蒸馏(KD)提供密集奖励信号。然而,这类方法假设存在此类辅助模型,而训练辅助模型会显著增加总训练时间。本文提出CoDistill-GRPO,这是一种共蒸馏算法,通过最大化精心设计的GRPO目标函数同步训练大模型与小模型。两个模型相互学习:小模型利用在线策略KD奖励从大模型分布中学习,而大模型则使用经重要性加权的小模型生成采样数据进行更新,从而降低采样生成的计算开销。实验表明,在Qwen和Llama系列模型的数学基准测试中,CoDistill-GRPO相较于标准GRPO显著提升了小模型性能。具体而言,在Qwen2.5-Math-1.5B模型上,我们观察到其在Minerva数据集上相比基础模型准确率提升超过11.6个百分点,相比GRPO额外提升6.0个百分点。有趣的是,采用CoDistill-GRPO训练的大模型(Qwen2.5-Math-7B)尽管使用小模型采样数据,其性能仍几乎与标准GRPO持平。这凸显了CoDistill-GRPO作为大模型GRPO经济高效替代方案的潜力,可实现约18%的速度提升,该发现可能具有独立研究价值。