Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.
翻译:Flow-GRPO 成功将强化学习应用于流模型,但采用了所有步骤的均匀信用分配。这忽略了扩散生成的时间结构:早期步骤决定组成和内容(低频结构),而后期步骤处理细节和纹理(高频细节)。此外,仅基于最终图像分配均匀信用可能会无意中奖励次优的中间步骤,尤其是当在扩散轨迹的后期纠正错误时。我们提出 Stepwise-Flow-GRPO,该方法基于每一步的奖励改进来分配信用。通过利用 Tweedie 公式获取中间奖励估计并引入基于增益的优势函数,我们的方法实现了更优的样本效率和更快的收敛速度。我们还引入了一种受 DDIM 启发的 SDE,在保留用于策略梯度的随机性的同时提高奖励质量。