Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.
翻译:视觉生成领域主要由三种范式主导:自回归(AR)模型、扩散模型和视觉自回归(VAR)模型。与AR和扩散模型不同,VAR模型在其生成步骤中处理异构的输入结构,这导致了严重的异步策略冲突。该问题在强化学习(RL)场景中变得尤为突出,导致训练不稳定和对齐效果欠佳。为解决此问题,我们提出了一个新颖的框架,通过显式管理这些冲突来增强组相对策略优化(GRPO)。我们的方法整合了三个协同组件:1)一个用于引导早期生成阶段的稳定化中间奖励;2)一个用于精确信用分配的动态时间步重加权方案;以及3)一种新颖的掩码传播算法,该算法源自奖励反馈学习(ReFL)原理,旨在在空间和时间上隔离优化效应。我们的方法在样本质量和目标对齐方面相比原始的GRPO基线显示出显著改进,从而为VAR模型实现了稳健且有效的优化。