Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.
翻译:稀疏-线性注意力(SLA)通过结合稀疏注意力与线性注意力来加速扩散模型,在视频生成任务中已展现出优异性能。然而,(i)SLA依赖基于注意力权重幅值的启发式分割策略,将计算分配至稀疏分支或线性分支,该策略可能并非最优。此外,(ii)通过对SLA中注意力误差进行形式化分析,我们发现SLA与直接分解为稀疏注意力和线性注意力的方案存在偏差。为此,我们提出SLA2,其包含三项核心改进:(I)引入可学习路由器,动态决定每项注意力计算应使用稀疏注意力还是线性注意力;(II)提出更忠实且直接的稀疏-线性注意力形式化方案,通过可学习比例系数融合稀疏与线性注意力分支;(III)设计稀疏+低比特注意力架构,其中通过量化感知微调引入低比特注意力以降低量化误差。实验表明,在视频扩散模型中,SLA2可实现97%的注意力稀疏度,在保持生成质量的同时获得18.6倍的注意力计算加速。