SePPO: Semi-Policy Preference Optimization for Diffusion Alignment

Reinforcement learning from human feedback (RLHF) methods are emerging as a way to fine-tune diffusion models (DMs) for visual generation. However, commonly used on-policy strategies are limited by the generalization capability of the reward model, while off-policy approaches require large amounts of difficult-to-obtain paired human-annotated data, particularly in visual generation tasks. To address the limitations of both on- and off-policy RLHF, we propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data. Specifically, we introduce a Semi-Policy Preference Optimization (SePPO) method. SePPO leverages previous checkpoints as reference models while using them to generate on-policy reference samples, which replace "losing images" in preference pairs. This approach allows us to optimize using only off-policy "winning images." Furthermore, we design a strategy for reference model selection that expands the exploration in the policy space. Notably, we do not simply treat reference samples as negative examples for learning. Instead, we design an anchor-based criterion to assess whether the reference samples are likely to be winning or losing images, allowing the model to selectively learn from the generated reference samples. This approach mitigates performance degradation caused by the uncertainty in reference sample quality. We validate SePPO across both text-to-image and text-to-video benchmarks. SePPO surpasses all previous approaches on the text-to-image benchmarks and also demonstrates outstanding performance on the text-to-video benchmarks. Code will be released in https://github.com/DwanZhang-AI/SePPO.

翻译：基于人类反馈的强化学习（RLHF）方法正逐渐成为微调扩散模型（DM）以实现视觉生成的一种途径。然而，常用的在线策略方法受限于奖励模型的泛化能力，而离线策略方法则需要大量难以获取的成对人类标注数据，这在视觉生成任务中尤为突出。为了克服在线和离线策略RLHF的局限性，我们提出了一种偏好优化方法，该方法无需依赖奖励模型或成对人类标注数据即可使DM与偏好对齐。具体而言，我们引入了半策略偏好优化（SePPO）方法。SePPO利用先前的检查点作为参考模型，同时使用它们生成在线策略参考样本，以替代偏好对中的“失败图像”。这种方法使我们能够仅使用离线策略的“获胜图像”进行优化。此外，我们设计了一种参考模型选择策略，以扩展策略空间的探索。值得注意的是，我们并非简单地将参考样本视为学习的负例。相反，我们设计了一种基于锚点的准则来评估参考样本更可能属于获胜图像还是失败图像，从而使模型能够有选择地从生成的参考样本中学习。这种方法缓解了因参考样本质量不确定性导致的性能下降。我们在文本到图像和文本到视频基准测试上验证了SePPO。SePPO在文本到图像基准测试中超越了所有先前方法，并在文本到视频基准测试中也展现出卓越性能。代码将在 https://github.com/DwanZhang-AI/SePPO 发布。