Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality images and captions to improve visual appeal and text alignment. We propose Diffusion-DPO, a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO), a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood, utilizing the evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. Our fine-tuned base model significantly outperforms both base SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement model in human evaluation, improving visual appeal and prompt alignment. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences, opening the door for scaling of diffusion model alignment methods.
翻译:大型语言模型通过使用人类比较数据和基于人类反馈的强化学习方法进行微调,以使其更好地符合用户偏好。与语言模型不同,文本到图像扩散模型中的人类偏好学习尚未得到广泛探索;现有最佳方法是通过精细挑选的高质量图像和描述对预训练模型进行微调,以提升视觉吸引力和文本对齐程度。我们提出Diffusion-DPO方法,通过直接在人类比较数据上进行优化来实现扩散模型与人类偏好的对齐。Diffusion-DPO改编自近期发展的直接偏好优化方法,这是一种更简单的替代RLHF的方案,通过分类目标直接优化最符合人类偏好的策略。我们重新构建DPO以考虑扩散模型的似然概念,利用证据下界推导出可微的目标函数。采用包含85.1万组众包成对偏好的Pick-a-Pic数据集,我们通过Diffusion-DPO对最先进的Stable Diffusion XL (SDXL)-1.0模型的基础模型进行微调。微调后的基础模型在人类评估中显著优于原始SDXL-1.0基线模型及包含额外精炼模块的更大规模SDXL-1.0模型,同时提升了视觉吸引力和提示对齐能力。我们还开发了一种使用AI反馈的变体方法,其性能可与基于人类偏好的训练相媲美,为扩散模型对齐方法的规模化应用开辟了道路。