Modern alignment techniques based on human preferences, such as RLHF and DPO, typically employ divergence regularization relative to the reference model to ensure training stability. However, this often limits the flexibility of models during alignment, especially when there is a clear distributional discrepancy between the preference data and the reference model. In this paper, we focus on the alignment of recent text-to-image diffusion models, such as Stable Diffusion XL (SDXL), and find that this "reference mismatch" is indeed a significant problem in aligning these models due to the unstructured nature of visual modalities: e.g., a preference for a particular stylistic aspect can easily induce such a discrepancy. Motivated by this observation, we propose a novel and memory-friendly preference alignment method for diffusion models that does not depend on any reference model, coined margin-aware preference optimization (MaPO). MaPO jointly maximizes the likelihood margin between the preferred and dispreferred image sets and the likelihood of the preferred sets, simultaneously learning general stylistic features and preferences. For evaluation, we introduce two new pairwise preference datasets, which comprise self-generated image pairs from SDXL, Pick-Style and Pick-Safety, simulating diverse scenarios of reference mismatch. Our experiments validate that MaPO can significantly improve alignment on Pick-Style and Pick-Safety and general preference alignment when used with Pick-a-Pic v2, surpassing the base SDXL and other existing methods. Our code, models, and datasets are publicly available via https://mapo-t2i.github.io
翻译:基于人类偏好的现代对齐技术,例如RLHF和DPO,通常采用相对于参考模型的散度正则化来确保训练稳定性。然而,这常常限制了对齐过程中模型的灵活性,尤其是在偏好数据与参考模型之间存在明显分布差异时。本文聚焦于近期文本到图像扩散模型(如Stable Diffusion XL (SDXL))的对齐,并发现由于视觉模态的非结构化特性,这种“参考失配”确实是此类模型对齐中的一个显著问题:例如,对特定风格方面的偏好很容易引发此类差异。基于此观察,我们提出了一种新颖且内存友好的扩散模型偏好对齐方法,该方法不依赖于任何参考模型,称为边界感知偏好优化(MaPO)。MaPO联合最大化偏好图像集与非偏好图像集之间的似然边界以及偏好图像集的似然,同时学习通用风格特征与偏好。为进行评估,我们引入了两个新的成对偏好数据集Pick-Style和Pick-Safety,它们由SDXL自生成的图像对构成,模拟了参考失配的多种场景。我们的实验验证了MaPO在Pick-Style和Pick-Safety上以及当与Pick-a-Pic v2结合使用时,在通用偏好对齐方面均能显著提升对齐效果,超越了基础SDXL及其他现有方法。我们的代码、模型和数据集已通过 https://mapo-t2i.github.io 公开提供。