Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text-to-image generation, we recently proposed Curriculum-DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum-DPO++, an enhanced method that combines the original data-level curriculum with a novel model-level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum-DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine-tuning is based on Low-Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low-rank matrices. Instead of maintaining a fixed capacity, we initialize the low-rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum-DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum-DPO. Finally, we compare Curriculum-DPO++ against Curriculum-DPO and other state-of-the-art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://github.com/CroitoruAlin/Curriculum-DPO.
翻译:直接偏好优化(DPO)已被提出作为人类反馈强化学习(RLHF)的高效替代方案。然而,无论是RLHF还是DPO都未考虑学习某些偏好比其他偏好更困难的事实,导致优化过程未能达到最优。为填补文本到图像生成领域的这一空白,我们近期提出了Curriculum-DPO——一种按难度组织图像对的方法。本文中,我们提出Curriculum-DPO++,这是一种将原始数据层面课程与新型模型层面课程相结合的增强方法。具体而言,我们提出随着训练进程动态提升去噪网络的学习容量。该容量提升通过两种机制实现:首先,我们仅使用原始Curriculum-DPO中可训练层的子集初始化模型,随着训练推进逐步解冻层直至架构与完整基线匹配;其次,基于低秩自适应(LoRA)的微调过程中,我们实现了低秩矩阵维度的渐进式调度方案——以显著小于基线模型的维度初始化低秩矩阵,在训练过程中逐步提升其秩值,使模型容量持续增长直至达到与Curriculum-DPO相同的秩值。此外,我们提出了替代Curriculum-DPO原有排序策略的新方案。最终,我们在九个基准测试中将Curriculum-DPO++与Curriculum-DPO及其他先进偏好优化方法进行比较,其在文本对齐度、美学质量和人类偏好方面均优于现有方法。代码已开源:https://github.com/CroitoruAlin/Curriculum-DPO。