Post training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO's clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps. Across image and video generation benchmarks, OP-GRPO achieves comparable or superior performance to Flow-GRPO with only 34.2% of the training steps on average, yielding substantial gains in training efficiency while maintaining generation quality.
翻译:通过GRPO进行后训练已显著提升了流匹配模型的生成质量。然而,GRPO因其在策略训练范式而存在样本效率低下的固有问题。为解决这一局限,我们提出OP-GRPO,首个专为流匹配模型设计的离策略GRPO框架。首先,我们主动筛选高质量轨迹,并将其自适应地纳入经验回放缓冲区,以供后续训练迭代复用。其次,为缓解离策略样本带来的分布偏移,我们提出序列级重要性采样修正方法,在保持GRPO裁剪机制完整性的同时确保策略更新稳定。此外,我们从理论与实证两方面证明,去噪后期步骤会产生病态的离策略比率,并通过截断后期步骤的轨迹来缓解该问题。在图像与视频生成基准测试中,OP-GRPO仅需平均34.2%的训练步数即可达到与Flow-GRPO相当或更优的性能,在保持生成质量的前提下大幅提升训练效率。