We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.
翻译:本文介绍SANA-Video,一种能够高效生成分辨率高达720x1280、时长可达分钟级别视频的小型扩散模型。SANA-Video能以极快的速度合成高分辨率、高质量、长时长且具有出色文本-视频对齐度的视频,并可部署于RTX 5090 GPU。两项核心设计确保了其高效、有效且长视频的生成能力:(1)线性DiT:我们采用线性注意力作为核心运算,鉴于视频生成需处理大量token,该机制比原始注意力更高效。(2)块线性注意力的恒定内存KV缓存:通过利用线性注意力的累积特性所衍生的恒定内存状态,我们设计了块式自回归方法用于生成长视频。该KV缓存以固定内存成本为线性DiT提供全局上下文,无需传统KV缓存,从而实现高效、分钟级视频生成。此外,我们探索了有效的数据过滤器和模型训练策略,将训练成本压缩至在64张H100 GPU上训练12天,仅为MovieGen成本的1%。得益于其低成本,SANA-Video在性能上可与现代先进的小型扩散模型(如Wan 2.1-1.3B和SkyReel-V2-1.3B)相竞争,同时实测延迟快16倍。此外,SANA-Video可在采用NVFP4精度的RTX 5090 GPU上部署,将生成一段5秒720p视频的推理速度从71秒加速至29秒(提速2.4倍)。总而言之,SANA-Video实现了低成本、高质量的视频生成。