Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation.
翻译:近期研究表明,将组相对策略优化(GRPO)集成到流匹配模型中,对于文本到图像和文本到视频生成尤为有效。然而,我们发现直接将此类技术应用于图像到视频(I2V)模型往往无法带来一致的奖励提升。为克服这一局限,我们提出了TAGRPO——一个受对比学习启发的、用于I2V模型的鲁棒性后训练框架。我们的方法基于以下观察:从相同初始噪声生成的推演视频能为优化提供更优的指导。利用这一洞见,我们提出了一种应用于中间隐空间的新型GRPO损失函数,该函数鼓励模型直接对齐高奖励轨迹,同时最大化与低奖励轨迹的距离。此外,我们引入了用于存储推演视频的记忆库,以增强多样性并降低计算开销。尽管方法简洁,TAGRPO在I2V生成任务上相比DanceGRPO取得了显著提升。