Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.
翻译:强化学习(RL)已成为扩散式图像合成模型后训练的标准技术,因其能够从奖励信号中学习,从而显式提升图像质量与提示对齐等理想特性。本文提出一种在线RL变体,通过采样配对轨迹并将流速度拉向更有利图像的方向,以降低模型更新的方差。与现有方法将每个采样步骤视为独立策略动作不同,我们将整个采样过程视为单一动作。实验采用高质量视觉语言模型与现成质量指标作为奖励,并使用广泛指标集评估输出结果。相较于现有方法,本方法收敛更快,并产生更高的输出质量与提示对齐度。