Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). In this paper, we propose a novel and enhanced version of DPO based on curriculum learning for text-to-image generation. Our method is divided into two training stages. First, a ranking of the examples generated for each prompt is obtained by employing a reward model. Then, increasingly difficult pairs of examples are sampled and provided to a text-to-image generative (diffusion or consistency) model. Generated samples that are far apart in the ranking are considered to form easy pairs, while those that are close in the ranking form hard pairs. In other words, we use the rank difference between samples as a measure of difficulty. The sampled pairs are split into batches according to their difficulty levels, which are gradually used to train the generative model. Our approach, Curriculum DPO, is compared against state-of-the-art fine-tuning approaches on three benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://anonymous.4open.science/r/Curriculum-DPO-EE14.
翻译:直接偏好优化(DPO)已被提出作为基于人类反馈的强化学习(RLHF)的有效且高效的替代方案。本文提出了一种基于课程学习的增强版DPO方法,用于文本到图像生成任务。我们的方法分为两个训练阶段:首先,通过奖励模型对每个提示词生成的样本进行排序;随后,按照难度递增的方式采样样本对并输入文本到图像生成模型(扩散模型或一致性模型)。排序中差距较大的样本构成简单样本对,而排序相近的样本则构成困难样本对。换言之,我们利用样本间的排序差异作为难度度量标准。采样得到的样本对根据难度等级分批,逐步用于训练生成模型。我们提出的课程化DPO方法在三个基准测试中与当前最先进的微调方法进行比较,在文本对齐度、美学质量和人类偏好方面均优于对比方法。代码已开源:https://anonymous.4open.science/r/Curriculum-DPO-EE14。