Streaming autoregressive (AR) video generators combined with few-step distillation achieve low-latency, high-quality synthesis, yet remain difficult to align via reinforcement learning from human feedback (RLHF). Existing SDE-based GRPO methods face challenges in this setting: few-step ODEs and consistency model samplers deviate from standard flow-matching ODEs, and their short, low-stochasticity trajectories are highly sensitive to initialization noise, rendering intermediate SDE exploration ineffective. We propose AR-CoPO (AutoRegressive Contrastive Policy Optimization), a framework that adapts the Neighbor GRPO contrastive perspective to streaming AR generation. AR-CoPO introduces chunk-level alignment via a forking mechanism that constructs neighborhood candidates at a randomly selected chunk, assigns sequence-level rewards, and performs localized GRPO updates. We further propose a semi-on-policy training strategy that complements on-policy exploration with exploitation over a replay buffer of reference rollouts, improving generation quality across domains. Experiments on Self-Forcing demonstrate that AR-CoPO improves both out-of-domain generalization and in-domain human preference alignment over the baseline, providing evidence of genuine alignment rather than reward hacking.
翻译:流式自回归视频生成器结合少步蒸馏技术实现了低延迟、高质量的视频合成,但通过人类反馈强化学习进行对齐仍然具有挑战性。现有的基于随机微分方程的GRPO方法在此场景下面临困难:少步常微分方程和一致性模型采样器偏离标准流匹配常微分方程,且其短程、低随机性的轨迹对初始噪声高度敏感,导致基于随机微分方程的中间探索失效。本文提出AR-CoPO(自回归对比策略优化),该框架将Neighbor GRPO的对比视角适配至流式自回归生成任务。AR-CoPO通过分块机制实现块级对齐:在随机选择的视频块处构建邻域候选序列,分配序列级奖励,并执行局部化的GRPO更新。我们进一步提出半在线策略训练策略,通过利用参考轨迹回放缓冲区进行探索与利用的平衡,提升了跨领域生成质量。在Self-Forcing数据集上的实验表明,AR-CoPO在领域外泛化能力和领域内人类偏好对齐方面均优于基线方法,证明了其实现了真实对齐而非奖励破解。