Aligning text-to-image (T2I) diffusion models with human preferences has emerged as a critical research challenge. While recent advances in this area have extended preference optimization techniques from large language models (LLMs) to the diffusion setting, they often struggle with limited exploration. In this work, we propose a novel and orthogonal approach to enhancing diffusion-based preference optimization. First, we introduce a stable reference model update strategy that relaxes the frozen reference model, encouraging exploration while maintaining a stable optimization anchor through reference model regularization. Second, we present a timestep-aware training strategy that mitigates the reward scale imbalance problem across timesteps. Our method can be integrated into various preference optimization algorithms. Experimental results show that our approach improves the performance of state-of-the-art methods on human preference evaluation benchmarks. The code is available at the Github: https://github.com/kaist-cvml/RethinkingDPO_Diffusion_Models.
翻译:将文本到图像(T2I)扩散模型与人类偏好对齐已成为一个关键的研究挑战。尽管该领域的最新进展已将偏好优化技术从大语言模型(LLMs)扩展到扩散模型设置,但这些方法常常受限于有限的探索能力。在本工作中,我们提出了一种新颖且正交的方法来增强基于扩散的偏好优化。首先,我们引入了一种稳定的参考模型更新策略,该策略放宽了对冻结参考模型的约束,通过参考模型正则化在鼓励探索的同时保持稳定的优化锚点。其次,我们提出了一种时间步感知的训练策略,以缓解不同时间步之间的奖励尺度不平衡问题。我们的方法可以集成到各种偏好优化算法中。实验结果表明,我们的方法在人类偏好评估基准上提升了最先进方法的性能。代码可在Github获取:https://github.com/kaist-cvml/RethinkingDPO_Diffusion_Models。