Recent progress in generative diffusion models has greatly advanced text-to-video generation. While text-to-video models trained on large-scale, diverse datasets can produce varied outputs, these generations often deviate from user preferences, highlighting the need for preference alignment on pre-trained models. Although Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation, we pioneer its adaptation to video diffusion models and propose a VideoDPO pipeline by making several key adjustments. Unlike previous image alignment methods that focus solely on either (i) visual quality or (ii) semantic alignment between text and videos, we comprehensively consider both dimensions and construct a preference score accordingly, which we term the OmniScore. We design a pipeline to automatically collect preference pair data based on the proposed OmniScore and discover that re-weighting these pairs based on the score significantly impacts overall preference alignment. Our experiments demonstrate substantial improvements in both visual quality and semantic alignment, ensuring that no preference aspect is neglected. Code and data will be shared at https://videodpo.github.io/.
翻译:生成式扩散模型的最新进展极大地推动了文本到视频生成技术的发展。尽管基于大规模多样化数据集训练的文本到视频模型能够生成多样化的输出,但这些生成结果往往偏离用户偏好,凸显了对预训练模型进行偏好对齐的必要性。尽管直接偏好优化(DPO)已在语言和图像生成领域展现出显著改进,我们率先将其适配至视频扩散模型,并通过多项关键调整提出了VideoDPO流程。与先前仅关注(i)视觉质量或(ii)文本与视频间语义对齐的图像对齐方法不同,我们综合考量这两个维度并据此构建偏好评分,称之为OmniScore。我们设计了基于OmniScore自动收集偏好对数据的流程,并发现依据该评分对数据对进行重新加权能显著影响整体偏好对齐效果。实验证明,该方法在视觉质量与语义对齐方面均取得实质性提升,确保所有偏好维度均得到充分考虑。代码与数据将在https://videodpo.github.io/ 公开。