SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Junsong Chen,Yuyang Zhao,Jincheng Yu,Ruihang Chu,Junyu Chen,Shuai Yang,Xianbang Wang,Yicheng Pan,Daquan Zhou,Huan Ling,Haozhe Liu,Hongwei Yi,Hao Zhang,Muyang Li,Yukang Chen,Han Cai,Sanja Fidler,Ping Luo,Song Han,Enze Xie

from arxiv, 21 pages, 15 figures, 7 tables

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

翻译：本文介绍SANA-Video，一种能够高效生成分辨率高达720x1280、时长可达分钟级别视频的小型扩散模型。SANA-Video能以极快的速度合成高分辨率、高质量、长时长且具有出色文本-视频对齐度的视频，并可部署于RTX 5090 GPU。两项核心设计确保了其高效、有效且长视频的生成能力：（1）线性DiT：我们采用线性注意力作为核心运算，鉴于视频生成需处理大量token，该机制比原始注意力更高效。（2）块线性注意力的恒定内存KV缓存：通过利用线性注意力的累积特性所衍生的恒定内存状态，我们设计了块式自回归方法用于生成长视频。该KV缓存以固定内存成本为线性DiT提供全局上下文，无需传统KV缓存，从而实现高效、分钟级视频生成。此外，我们探索了有效的数据过滤器和模型训练策略，将训练成本压缩至在64张H100 GPU上训练12天，仅为MovieGen成本的1%。得益于其低成本，SANA-Video在性能上可与现代先进的小型扩散模型（如Wan 2.1-1.3B和SkyReel-V2-1.3B）相竞争，同时实测延迟快16倍。此外，SANA-Video可在采用NVFP4精度的RTX 5090 GPU上部署，将生成一段5秒720p视频的推理速度从71秒加速至29秒（提速2.4倍）。总而言之，SANA-Video实现了低成本、高质量的视频生成。