VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

Recent progress in generative diffusion models has greatly advanced text-to-video generation. While text-to-video models trained on large-scale, diverse datasets can produce varied outputs, these generations often deviate from user preferences, highlighting the need for preference alignment on pre-trained models. Although Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation, we pioneer its adaptation to video diffusion models and propose a VideoDPO pipeline by making several key adjustments. Unlike previous image alignment methods that focus solely on either (i) visual quality or (ii) semantic alignment between text and videos, we comprehensively consider both dimensions and construct a preference score accordingly, which we term the OmniScore. We design a pipeline to automatically collect preference pair data based on the proposed OmniScore and discover that re-weighting these pairs based on the score significantly impacts overall preference alignment. Our experiments demonstrate substantial improvements in both visual quality and semantic alignment, ensuring that no preference aspect is neglected. Code and data will be shared at https://videodpo.github.io/.

翻译：生成式扩散模型的最新进展极大地推动了文本到视频生成技术的发展。尽管基于大规模多样化数据集训练的文本到视频模型能够生成多样化的输出，但这些生成结果往往偏离用户偏好，凸显了对预训练模型进行偏好对齐的必要性。尽管直接偏好优化（DPO）已在语言和图像生成领域展现出显著改进，我们率先将其适配至视频扩散模型，并通过多项关键调整提出了VideoDPO流程。与先前仅关注（i）视觉质量或（ii）文本与视频间语义对齐的图像对齐方法不同，我们综合考量这两个维度并据此构建偏好评分，称之为OmniScore。我们设计了基于OmniScore自动收集偏好对数据的流程，并发现依据该评分对数据对进行重新加权能显著影响整体偏好对齐效果。实验证明，该方法在视觉质量与语义对齐方面均取得实质性提升，确保所有偏好维度均得到充分考虑。代码与数据将在https://videodpo.github.io/ 公开。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/