QVGen: Pushing the Limit of Quantized Video Generative Models

Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present QVGen, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (e.g., 4-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules ($Φ$) to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of $Φ$, we propose a rank-decay strategy that progressively eliminates $Φ$. Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization $\mathbfγ$ to identify and decay low-contributing components. This strategy retains performance while zeroing out additional inference overhead. Extensive experiments across $4$ state-of-the-art (SOTA) video DMs, with parameter sizes ranging from $1.3\text{B}\sim14\text{B}$, show that QVGen is the first to reach full-precision comparable quality under 4-bit settings. Moreover, it significantly outperforms existing methods. For instance, our 3-bit CogVideoX-2B achieves improvements of $+25.28$ in Dynamic Degree and $+8.43$ in Scene Consistency on VBench. Code and models are available at https://github.com/ModelTC/QVGen.

翻译：视频扩散模型已能实现高质量的视频生成。然而，其巨大的计算和内存需求对实际部署构成了严峻挑战，即使在高性能GPU上也是如此。作为一种广泛采用的解决方案，量化在降低图像扩散模型成本方面已取得显著成功，但其直接应用于视频扩散模型时仍效果不佳。本文提出QVGen，一种专为极低位宽量化（例如4位或更低）下高性能且推理高效的视频扩散模型设计的新型量化感知训练框架。我们首先通过理论分析证明，降低梯度范数对于促进量化感知训练的收敛至关重要。为此，我们引入辅助模块（$Φ$）以减轻较大的量化误差，从而显著提升收敛性。为消除$Φ$带来的推理开销，我们提出一种秩衰减策略来逐步消除$Φ$。具体而言，我们重复使用奇异值分解以及提出的基于秩的正则化$\mathbfγ$来识别并衰减贡献较低的成分。该策略在保持性能的同时，将额外的推理开销降至零。在涵盖参数规模从$1.3\text{B}\sim14\text{B}$的$4$个最先进视频扩散模型上进行的大量实验表明，QVGen是首个在4位设置下达到与全精度相当质量的方法。此外，其性能显著优于现有方法。例如，我们的3位CogVideoX-2B模型在VBench基准测试中，动态程度和场景一致性分别提升了$+25.28$和$+8.43$。代码与模型发布于https://github.com/ModelTC/QVGen。