Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/
翻译:视频扩散模型在真实感和可控性方面取得了显著进展,但受限于高计算需求,阻碍了其在移动设备上的应用。本文提出了首个面向移动设备优化的视频扩散模型。我们以Stable Video Diffusion (SVD)的时空UNet为基础,通过降低帧分辨率、引入多尺度时间表征以及提出两种新颖的剪枝方案来减少通道数和时间块数量,从而显著降低了内存和计算成本。此外,我们采用对抗性微调技术,将去噪过程缩减至单步完成。我们提出的模型命名为MobileVD,其计算效率提升了523倍(4.34 TFLOPs对比1817.2 TFLOPs),在质量上仅有轻微下降(FVD 149对比171),能够在小米14 Pro上以1.7秒生成14x512x256像素视频片段的潜在表示。相关结果发布于https://qualcomm-ai-research.github.io/mobile-video-diffusion/。