Diffusion Transformers have recently demonstrated remarkable performance in video generation. However, the long input sequences result in high computational latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free sparse attention is constrained by limited sparsity and thus offers modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation for training. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. By incorporating an input-dependent gating mechanism to finely balance the two branches, our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples and 1,600 training steps with a batch size of 8.
翻译:扩散Transformer近期在视频生成领域展现出卓越性能。然而,由于全注意力机制具有二次复杂度,长输入序列会导致较高的计算延迟。已有多种稀疏注意力机制被提出。免训练的稀疏注意力受限于较低的稀疏度,仅能提供有限的加速效果;而基于训练的方法虽能达到更高的稀疏度,却需要大量数据和计算资源进行训练。本研究提出SALAD方法,在稀疏注意力旁并行引入轻量级线性注意力分支。通过采用输入依赖的门控机制精细平衡两个分支,我们的方法实现了90%的稀疏度和1.72倍的推理加速,同时保持与全注意力基线相当的生成质量。此外,我们的微调过程具有高效性,仅需2,000个视频样本和1,600个训练步数(批次大小为8)。