Reinforcement learning has been widely applied to diffusion and flow models for visual tasks such as text-to-image generation. However, these tasks remain challenging because diffusion models have intractable likelihoods, which creates a barrier for directly applying popular policy-gradient type methods. Existing approaches primarily focus on crafting new objectives built on already heavily engineered LLM objectives, using ad hoc estimators for likelihood, without a thorough investigation into how such estimation affects overall algorithmic performance. In this work, we provide a systematic analysis of the RL design space by disentangling three factors: i) policy-gradient objectives, ii) likelihood estimators, and iii) rollout sampling schemes. We show that adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional. We validate our findings across multiple reward benchmarks using SD 3.5 Medium, and observe consistent trends across all tasks. Our method improves the GenEval score from 0.24 to 0.95 in 90 GPU hours, which is $4.6\times$ more efficient than FlowGRPO and $2\times$ more efficient than the SOTA method DiffusionNFT without reward hacking.
翻译:强化学习已广泛应用于扩散模型和流模型,以完成文本到图像生成等视觉任务。然而,这些任务仍然具有挑战性,因为扩散模型具有难以处理的似然,这为直接应用流行的策略梯度类方法设置了障碍。现有方法主要侧重于基于已经经过大量工程化设计的大语言模型目标来构建新的目标函数,并使用临时性的似然估计器,而没有深入研究这种估计如何影响整体算法性能。在本工作中,我们通过解耦三个因素,对强化学习的设计空间进行了系统性分析:i) 策略梯度目标,ii) 似然估计器,以及 iii) 轨迹采样方案。我们证明,采用基于证据下界(ELBO)的模型似然估计器(仅从最终生成的样本计算得出)是实现有效、高效且稳定的强化学习优化的主导因素,其重要性超过了特定策略梯度损失函数的影响。我们使用 SD 3.5 Medium 在多个奖励基准上验证了我们的发现,并在所有任务中观察到一致的趋势。我们的方法在 90 GPU 小时内将 GenEval 分数从 0.24 提高到 0.95,这比 FlowGRPO 效率高 $4.6\times$,比当前最先进方法 DiffusionNFT 效率高 $2\times$,且未发生奖励破解。