Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs.
翻译:视频生成通过修正流技术取得了显著进展,但运动不流畅、视频与提示词不对齐等问题依然存在。本研究开发了一个系统化流程,利用人类反馈来缓解这些问题并优化视频生成模型。具体而言,我们首先构建了一个专注于现代视频生成模型的大规模人类偏好数据集,其中包含了跨多个维度的成对标注。随后,我们引入了VideoReward——一个多维视频奖励模型,并探究了标注方式及不同设计选择对其奖励效能的影响。基于以KL正则化最大化奖励为目标的统一强化学习视角,我们提出了三种适用于流式模型的校准算法:包括两种训练时策略——流式直接偏好优化(Flow-DPO)与流式奖励加权回归(Flow-RWR),以及一种推理时技术Flow-NRG,该技术直接将奖励引导应用于含噪视频。实验结果表明,VideoReward显著优于现有奖励模型,且Flow-DPO在性能上优于Flow-RWR及监督微调方法。此外,Flow-NRG允许用户在推理过程中为多个目标分配自定义权重,从而满足个性化的视频质量需求。