We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only $\sim$213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at $36\times$ higher throughput for scalable world modeling.
翻译:我们提出SANA-WM,一个高效的2.6B参数开源世界模型,原生训练用于一分钟生成,可合成高保真、720p分辨率、分钟级视频,并具备精确相机控制能力。SANA-WM在视觉质量上可与LingBot-World和HY-WorldPlay等大规模工业基线相媲美,同时显著提升效率。四核心设计驱动其架构:(1) 混合线性注意力机制将逐帧门控DeltaNet与softmax注意力相结合,实现内存高效的长上下文建模。(2) 双分支相机控制确保精确的六自由度轨迹遵循。(3) 两阶段生成流水线对阶段1输出应用长视频精炼器,提升序列间质量与一致性。(4) 鲁棒标注流水线从公开视频中提取精确公制尺度六自由度相机位姿,生成高质量、时空一致的动作标签。基于这些设计,SANA-WM在数据、训练算力和推理硬件上展现出卓越效率:仅使用约21.3万段带公制尺度位姿监督的公开视频片段,在64块H100上15天内完成训练,并可在单GPU上生成每段60秒剪辑;其蒸馏变体可部署于单张RTX 5090,通过NVFP4量化在34秒内完成60秒720p视频的去噪。在我们的分钟级世界模型基准上,SANA-WM展现出比现有开源基线更强的动作跟随精度,并在可实现的可扩展世界建模中,以36倍吞吐量达到可比视觉质量。