Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities. Code and models will be made publicly available.
翻译:扩散模型已成为视频生成领域的事实标准范式。然而,这类模型依赖质量参差不齐的网络数据,常导致生成结果视觉观感欠佳且与文本提示对齐不足。针对此问题,我们提出InstructVideo框架,通过奖励微调实现基于人类反馈的文本到视频扩散模型指令优化。InstructVideo包含两项核心设计:1) 为缓解完整DDIM采样链生成过程带来的奖励微调成本过高问题,我们重新将奖励微调定义为编辑任务。通过利用扩散过程对采样视频进行扰动,InstructVideo仅需执行DDIM采样链的部分推断,在降低微调成本的同时提升微调效率。2) 为弥补缺乏符合人类偏好的专用视频奖励模型之不足,我们迁移复用已有的图像奖励模型(如HPSv2)。为此提出分段视频奖励机制——基于分段稀疏采样提供奖励信号,以及时间衰减奖励方法——缓解微调过程中时间建模退化问题。大量定性与定量实验验证了将图像奖励模型应用于InstructVideo的实用性与有效性,该方法在保持泛化能力的同时显著提升了生成视频的视觉质量。相关代码与模型将公开发布。