Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices. Through appropriately aligned block structure and our extended tiled Monarch parameterization, we achieve high expressivity while preserving computational efficiency. We further overcome the overhead of parameterization through finetuning, with custom Triton kernels. We first validate the high efficacy of Monarch-RT over existing sparse baselines designed only for bidirectional models. We further observe that Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing, making Monarch-RT a pioneering work on highly-capable sparse attention parameterization for real-time video generation. Our optimized implementation outperforms FlashAttention-2, FlashAttention-3, and FlashAttention-4 kernels on Nvidia RTX 5090, H100, and B200 GPUs respectively, providing kernel speedups in the range of 1.4-11.8X. This enables us, for the first time, to achieve true real-time video generation with Self-Forcing at 16 FPS on a single RTX 5090.
翻译:基于扩散Transformer的实时视频生成受限于3D自注意力的二次计算成本,这在兼具少步采样与自回归特性的实时生成场景中尤为突出——时间维度上的误差会不断累积,且每个去噪步骤必须承载更丰富的信息。在此情境下,我们发现先前针对双向多步扩散设计的稀疏注意力近似方法均告失效。具体而言,我们观察到视频注意力并非稳定稀疏,而是呈现出由时空位置驱动的显著周期性结构,并融合了动态的稀疏语义对应关系与密集混合特征,其表征能力甚至超越了理想化的top-k注意力机制。基于此发现,我们提出Monarch-RT——一种面向视频扩散模型的结构化注意力参数化方法,该方法利用Monarch矩阵对注意力进行因子分解。通过精心设计的对齐块结构及我们扩展的平铺式Monarch参数化方案,我们在保持计算效率的同时实现了高表达能力。借助定制化的Triton内核与微调策略,我们进一步克服了参数化带来的额外开销。实验首先验证了Monarch-RT相较于现有仅针对双向模型设计的稀疏基线方法的显著优越性。进一步研究发现,当应用于当前最先进的Self-Forcing模型时,Monarch-RT在保持生成质量无损的前提下可实现高达95%的注意力稀疏度,这使其成为实时视频生成领域首个高性能稀疏注意力参数化方案。我们的优化实现在Nvidia RTX 5090、H100及B200 GPU上分别超越了FlashAttention-2、FlashAttention-3与FlashAttention-4内核,获得了1.4至11.8倍的加速比。基于此突破,我们首次在单张RTX 5090显卡上实现了以16 FPS帧率运行的Self-Forcing模型,达成了真正的实时视频生成。