Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation

Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text-to-image generation, we recently proposed Curriculum-DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum-DPO++, an enhanced method that combines the original data-level curriculum with a novel model-level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum-DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine-tuning is based on Low-Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low-rank matrices. Instead of maintaining a fixed capacity, we initialize the low-rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum-DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum-DPO. Finally, we compare Curriculum-DPO++ against Curriculum-DPO and other state-of-the-art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://github.com/CroitoruAlin/Curriculum-DPO.

翻译：直接偏好优化（DPO）已被提出作为人类反馈强化学习（RLHF）的高效替代方案。然而，无论是RLHF还是DPO都未考虑学习某些偏好比其他偏好更困难的事实，导致优化过程未能达到最优。为填补文本到图像生成领域的这一空白，我们近期提出了Curriculum-DPO——一种按难度组织图像对的方法。本文中，我们提出Curriculum-DPO++，这是一种将原始数据层面课程与新型模型层面课程相结合的增强方法。具体而言，我们提出随着训练进程动态提升去噪网络的学习容量。该容量提升通过两种机制实现：首先，我们仅使用原始Curriculum-DPO中可训练层的子集初始化模型，随着训练推进逐步解冻层直至架构与完整基线匹配；其次，基于低秩自适应（LoRA）的微调过程中，我们实现了低秩矩阵维度的渐进式调度方案——以显著小于基线模型的维度初始化低秩矩阵，在训练过程中逐步提升其秩值，使模型容量持续增长直至达到与Curriculum-DPO相同的秩值。此外，我们提出了替代Curriculum-DPO原有排序策略的新方案。最终，我们在九个基准测试中将Curriculum-DPO++与Curriculum-DPO及其他先进偏好优化方法进行比较，其在文本对齐度、美学质量和人类偏好方面均优于现有方法。代码已开源：https://github.com/CroitoruAlin/Curriculum-DPO。

相关内容

课程

关注 6

课程是指学校学生所应学习的学科总和及其进程与安排。课程是对教育的目标、教学内容、教学活动方式的规划和设计，是教学计划、教学大纲等诸多方面实施过程的总和。广义的课程是指学校为实现培养目标而选择的教育内容及其进程的总和，它包括学校老师所教授的各门学科和有目的、有计划的教育活动。狭义的课程是指某一门学科。专知上对国内外最新AI+X的课程进行了收集与索引，涵盖斯坦福大学、CMU、MIT、清华、北大等名校开放课程。

【EMNLP2025】面向大语言模型的权重旋转偏好优化

专知会员服务

12+阅读 · 2025年8月27日

直接偏好优化中的数据集、理论、变体和应用的综合综述

专知会员服务

15+阅读 · 2024年10月24日