Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models

The high computational cost and slow inference time are major obstacles to deploying the video diffusion model (VDM) in practical applications. To overcome this, we introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss. First, we empirically observe that deeper VDM layers are crucial for maintaining the quality of \textbf{motion dynamics} e.g., coherence of the entire video, while shallower layers are more focused on \textbf{individual content} e.g., individual frames. Therefore, we prune redundant blocks from the shallower layers while preserving more of the deeper layers, resulting in a lightweight VDM variant called VDMini. Additionally, we propose an \textbf{Individual Content and Motion Dynamics (ICMD)} Consistency Loss to gain comparable generation performance as larger VDM, i.e., the teacher to VDMini i.e., the student. Particularly, we first use the Individual Content Distillation (ICD) Loss to ensure consistency in the features of each generated frame between the teacher and student models. Next, we introduce a Multi-frame Content Adversarial (MCA) Loss to enhance the motion dynamics across the generated video as a whole. This method significantly accelerates inference time while maintaining high-quality video generation. Extensive experiments demonstrate the effectiveness of our VDMini on two important video generation tasks, Text-to-Video (T2V) and Image-to-Video (I2V), where we respectively achieve an average 2.5 $\times$ and 1.4 $\times$ speed up for the I2V method SF-V and the T2V method T2V-Turbo-v2, while maintaining the quality of the generated videos on two benchmarks, i.e., UCF101 and VBench.

翻译：高计算成本和慢推理速度是视频扩散模型在实际应用中部署的主要障碍。为克服此问题，我们提出了一种新的视频扩散模型压缩方法，该方法采用个体内容与运动动态保持的剪枝策略及一致性损失。首先，我们通过实证观察到，较深的VDM层对于保持**运动动态**质量（例如整个视频的连贯性）至关重要，而较浅的层则更侧重于**个体内容**（例如单个帧）。因此，我们从较浅层剪枝冗余块，同时保留更多深层，从而得到一个轻量化的VDM变体，称为VDMini。此外，我们提出了一种**个体内容与运动动态一致性损失**，以使VDMini（学生模型）获得与较大VDM（教师模型）相当的生成性能。具体而言，我们首先使用个体内容蒸馏损失来确保教师模型与学生模型之间每个生成帧的特征一致性。接着，我们引入多帧内容对抗损失以增强生成视频整体的运动动态。该方法在保持高质量视频生成的同时，显著加快了推理速度。大量实验证明了我们的VDMini在两个重要视频生成任务上的有效性：文本到视频和图像到视频。在图像到视频方法SF-V和文本到视频方法T2V-Turbo-v2上，我们分别实现了平均2.5倍和1.4倍的加速，同时在UCF101和VBench两个基准测试上保持了生成视频的质量。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日