InstructVideo: Instructing Video Diffusion Models with Human Feedback

Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities. Code and models will be made publicly available.

翻译：扩散模型已成为视频生成领域的事实标准范式。然而，这类模型依赖质量参差不齐的网络数据，常导致生成结果视觉观感欠佳且与文本提示对齐不足。针对此问题，我们提出InstructVideo框架，通过奖励微调实现基于人类反馈的文本到视频扩散模型指令优化。InstructVideo包含两项核心设计：1) 为缓解完整DDIM采样链生成过程带来的奖励微调成本过高问题，我们重新将奖励微调定义为编辑任务。通过利用扩散过程对采样视频进行扰动，InstructVideo仅需执行DDIM采样链的部分推断，在降低微调成本的同时提升微调效率。2) 为弥补缺乏符合人类偏好的专用视频奖励模型之不足，我们迁移复用已有的图像奖励模型（如HPSv2）。为此提出分段视频奖励机制——基于分段稀疏采样提供奖励信号，以及时间衰减奖励方法——缓解微调过程中时间建模退化问题。大量定性与定量实验验证了将图像奖励模型应用于InstructVideo的实用性与有效性，该方法在保持泛化能力的同时显著提升了生成视频的视觉质量。相关代码与模型将公开发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日