Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent space. However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video VAEs. Our model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video generation. Our models and code are available at https://github.com/westlake-repl/LeanVAE
翻译:潜在视频扩散模型(LVDMs)的最新进展通过利用视频变分自编码器(Video VAEs)将复杂视频数据压缩至紧凑的潜在空间,彻底改变了视频生成领域。然而,随着LVDM训练规模的扩大,Video VAE的计算开销成为关键瓶颈,尤其是在编码高分辨率视频时。为解决此问题,我们提出LeanVAE——一种新颖的超高效Video VAE框架,其引入两项关键创新:(1)基于邻域感知前馈(NAF)模块与非重叠块操作的轻量级架构,大幅降低计算成本;(2)集成小波变换与压缩感知技术以提升重建质量。大量实验验证了LeanVAE在视频重建与生成方面的优越性,尤其在提升现有Video VAE的效率方面表现突出。我们的模型在保持竞争力重建质量的同时,实现了高达50倍的计算量(FLOPs)减少和44倍的推理加速,为可扩展的高效视频生成提供了新思路。模型与代码已开源:https://github.com/westlake-repl/LeanVAE