轻量化驱动：通过稀疏注意力加速自回归视频扩散模型 (Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention)

Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (\ie, frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (\eg, 84.5 on VBench) and efficiency (\eg, $1.2{\sim}1.3\times$ end-to-end speedup). Combined with FP8 quantization and LightVAE, \textsc{Light Forcing} further achieves a $2.3\times$ speedup and 19.7\,FPS on an RTX~5090 GPU. Code will be released at \href{https://github.com/chengtao-lv/LightForcing}{https://github.com/chengtao-lv/LightForcing}.

翻译：先进的自回归视频生成模型在视觉保真度和交互性方面已取得显著提升，但注意力机制的二次复杂度仍是其高效部署的主要瓶颈。尽管现有的稀疏注意力解决方案在双向模型上展现出潜力，我们发现将这些方案直接应用于自回归模型会导致显著的性能下降，原因有二：对视频片段生成的孤立考量以及对历史信息上下文利用不足。基于这些观察，我们提出了首个专为自回归视频生成模型设计的稀疏注意力解决方案——\textsc{Light Forcing}。该方法引入了一种\textit{片段感知增长}机制，用于定量评估每个视频片段的贡献度，从而确定其稀疏性分配。这种渐进式稀疏度增长策略使得当前片段在生成过程中能够继承先前片段的知识。此外，我们提出了一种\textit{分层稀疏注意力}机制，以从粗到细的方式捕捉信息丰富的历史上下文和局部上下文。这种两级掩码选择策略（即帧级和块级）能够自适应地处理多样化的注意力模式。大量实验表明，我们的方法在生成质量（例如，VBench评分84.5）和效率（例如，端到端加速$1.2{\sim}1.3\times$）上均优于现有稀疏注意力方案。结合FP8量化和LightVAE，\textsc{Light Forcing}在RTX~5090 GPU上进一步实现了$2.3\times$的加速和19.7\,FPS的生成速度。代码将在\href{https://github.com/chengtao-lv/LightForcing}{https://github.com/chengtao-lv/LightForcing}发布。