The high computational cost and slow inference time are major obstacles to deploying the video diffusion model (VDM) in practical applications. To overcome this, we introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss. First, we empirically observe that deeper VDM layers are crucial for maintaining the quality of \textbf{motion dynamics} e.g., coherence of the entire video, while shallower layers are more focused on \textbf{individual content} e.g., individual frames. Therefore, we prune redundant blocks from the shallower layers while preserving more of the deeper layers, resulting in a lightweight VDM variant called VDMini. Additionally, we propose an \textbf{Individual Content and Motion Dynamics (ICMD)} Consistency Loss to gain comparable generation performance as larger VDM, i.e., the teacher to VDMini i.e., the student. Particularly, we first use the Individual Content Distillation (ICD) Loss to ensure consistency in the features of each generated frame between the teacher and student models. Next, we introduce a Multi-frame Content Adversarial (MCA) Loss to enhance the motion dynamics across the generated video as a whole. This method significantly accelerates inference time while maintaining high-quality video generation. Extensive experiments demonstrate the effectiveness of our VDMini on two important video generation tasks, Text-to-Video (T2V) and Image-to-Video (I2V), where we respectively achieve an average 2.5 $\times$ and 1.4 $\times$ speed up for the I2V method SF-V and the T2V method T2V-Turbo-v2, while maintaining the quality of the generated videos on two benchmarks, i.e., UCF101 and VBench.
翻译:高计算成本和慢推理速度是视频扩散模型在实际应用中部署的主要障碍。为克服此问题,我们提出了一种新的视频扩散模型压缩方法,该方法采用个体内容与运动动态保持的剪枝策略及一致性损失。首先,我们通过实证观察到,较深的VDM层对于保持**运动动态**质量(例如整个视频的连贯性)至关重要,而较浅的层则更侧重于**个体内容**(例如单个帧)。因此,我们从较浅层剪枝冗余块,同时保留更多深层,从而得到一个轻量化的VDM变体,称为VDMini。此外,我们提出了一种**个体内容与运动动态一致性损失**,以使VDMini(学生模型)获得与较大VDM(教师模型)相当的生成性能。具体而言,我们首先使用个体内容蒸馏损失来确保教师模型与学生模型之间每个生成帧的特征一致性。接着,我们引入多帧内容对抗损失以增强生成视频整体的运动动态。该方法在保持高质量视频生成的同时,显著加快了推理速度。大量实验证明了我们的VDMini在两个重要视频生成任务上的有效性:文本到视频和图像到视频。在图像到视频方法SF-V和文本到视频方法T2V-Turbo-v2上,我们分别实现了平均2.5倍和1.4倍的加速,同时在UCF101和VBench两个基准测试上保持了生成视频的质量。