Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (i.e., frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (e.g., 84.5 on VBench) and efficiency (e.g., $1.2{\sim}1.3\times$ end-to-end speedup). Combined with other efficient solutions, \textsc{Light Forcing} further achieves a $2.0{\sim}3.0\times$ end-to-end speedup across diverse GPUs (e.g., 27.4\,FPS on RTX 5090 and 33.9\,FPS on H100). Code is released via this \href{https://github.com/chengtao-lv/LightForcing}{link}.

翻译：先进的自回归（AR）视频生成模型在视觉保真度和交互性上取得了改进，但注意力的二次复杂度仍是高效部署的主要瓶颈。虽然现有的稀疏注意力方案在双向模型上表现出了潜力，但我们发现将这些方案应用于AR模型会导致显著的性能下降，原因有二：孤立地考虑块生成以及未能充分利用具有信息的过去上下文。受这些观察启发，我们提出了\textsc{Light Forcing}，这是\textit{首个}为AR视频生成模型定制的稀疏注意力方案。它引入了一个\textit{块感知增长}机制，以定量估计每个块的贡献，从而确定其稀疏性分配。这种渐进式的稀疏性增加策略使得当前块在生成过程中能够继承先前块中的先验知识。此外，我们引入了一个\textit{层次化稀疏注意力}，以从粗到细的方式捕捉信息丰富的历史上下文和局部上下文。这种两级掩码选择策略（即帧级和块级）能够自适应地处理多样化的注意力模式。大量实验表明，我们的方法在质量（例如，VBench上84.5）和效率（例如，$1.2{\sim}1.3\times$端到端加速）上均优于现有稀疏注意力。结合其他高效解决方案，\textsc{Light Forcing}在不同GPU上（例如，RTX 5090上27.4\,FPS和H100上33.9\,FPS）进一步实现了$2.0{\sim}3.0\times$端到端加速。代码通过此\href{https://github.com/chengtao-lv/LightForcing}{链接}发布。