Aligning text-to-image diffusion model (T2I) with preference has been gaining increasing research attention. While prior works exist on directly optimizing T2I by preference data, these methods are developed under the bandit assumption of a latent reward on the entire diffusion reverse chain, while ignoring the sequential nature of the generation process. From literature, this may harm the efficacy and efficiency of alignment. In this paper, we take on a finer dense reward perspective and derive a tractable alignment objective that emphasizes the initial steps of the T2I reverse chain. In particular, we introduce temporal discounting into the DPO-style explicit-reward-free loss, to break the temporal symmetry therein and suit the T2I generation hierarchy. In experiments on single and multiple prompt generation, our method is competitive with strong relevant baselines, both quantitatively and qualitatively. Further studies are conducted to illustrate the insight of our approach.
翻译:将文本到图像扩散模型(T2I)与用户偏好对齐正日益受到研究关注。尽管已有工作通过偏好数据直接优化T2I模型,但这些方法基于对整个扩散逆过程存在隐式奖励的赌博机假设,忽略了生成过程的时序特性。文献表明,这可能会损害对齐效果与效率。本文从更精细的稠密奖励视角出发,推导出一个可求解的对齐目标函数,该函数重点强化T2I逆过程的初始阶段。具体而言,我们将时间折扣机制引入DPO风格的无显式奖励损失函数中,以打破其时序对称性,从而适配T2I生成的层次结构。在单提示与多提示生成的实验中,本方法在定量与定性指标上均能与强基线方法竞争。进一步实验验证了本方法的设计思路。