Recent advancements adopt online reinforcement learning (RL) from LLMs to text-to-image rectified flow diffusion models for reward alignment. The use of group-level rewards successfully aligns the model with the targeted reward. However, it faces challenges including low efficiency, dependency on stochastic samplers, and reward hacking. The problem is that rectified flow models are fundamentally different from LLMs: 1) For efficiency, online image sampling takes much more time and dominates the time of training. 2) For stochasticity, rectified flow is deterministic once the initial noise is fixed. Aiming at these problems and inspired by the effects of group-level rewards from LLMs, we design Group-level Direct Reward Optimization (GDRO). GDRO is a new post-training paradigm for group-level reward alignment that combines the characteristics of rectified flow models. Through rigorous theoretical analysis, we point out that GDRO supports full offline training that saves the large time cost for image rollout sampling. Also, it is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtain stochasticity. We also empirically study the reward hacking trap that may mislead the evaluation, and involve this factor in the evaluation using a corrected score that not only considers the original evaluation reward but also the trend of reward hacking. Extensive experiments demonstrate that GDRO effectively and efficiently improves the reward score of the diffusion model through group-wise offline optimization across the OCR and GenEval tasks, while demonstrating strong stability and robustness in mitigating reward hacking.
翻译:近期研究将大型语言模型中的在线强化学习方法应用于文本到图像的整流流扩散模型,以实现奖励对齐。群体级奖励的使用成功地将模型与目标奖励对齐,但该方法面临效率低下、依赖随机采样器以及奖励黑客攻击等挑战。问题的根源在于整流流模型与大型语言模型存在本质差异:1)在效率方面,在线图像采样耗时显著,主导了训练时间。2)在随机性方面,一旦初始噪声固定,整流流具有确定性。针对这些问题,并受大型语言模型中群体级奖励效果的启发,我们设计了群体级直接奖励优化方法。该方法是一种结合整流流模型特性的新型群体级奖励对齐后训练范式。通过严格的理论分析,我们指出该方法支持完全离线训练,从而节省了图像采样所需的大量时间成本。同时,该方法独立于扩散采样器,无需通过常微分方程到随机微分方程的近似来获取随机性。我们还通过实证研究分析了可能误导评估的奖励黑客陷阱,并在评估中引入校正分数以纳入该因素,该分数不仅考虑原始评估奖励,还兼顾奖励黑客攻击的趋势。大量实验表明,该方法通过跨OCR和GenEval任务的群体级离线优化,有效且高效地提升了扩散模型的奖励分数,同时在缓解奖励黑客攻击方面展现出强大的稳定性和鲁棒性。