Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, prior attempts fail to match the expressiveness of softmax attention without costly retraining. We introduce Attention Surgery, an efficient framework for linearizing or hybridizing attention in pretrained VDMs without training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art DiT-based VDM, Attention Surgery achieves the first competitive sub-quadratic attention video diffusion models, reducing attention cost by up to 40\% in terms of FLOPs, while maintaining generation quality as measured on the standard VBench and VBench-2.0 benchmarks. Project page is available at: https://qualcomm-ai-research.github.io/attention-surgery.
翻译:基于Transformer的视频扩散模型(VDMs)在视频生成质量方面达到了最先进水平,但受限于自注意力机制的二次计算成本,导致长序列和高分辨率场景下的计算开销巨大。尽管线性注意力提供了次二次复杂度,但先前的方法在未进行昂贵重新训练的情况下,难以匹配softmax注意力的表达能力。我们提出了注意力手术,一种无需从头训练即可在预训练VDMs中线性化或混合化注意力的高效框架。受语言模型最新进展的启发,我们的方法结合了一种新颖的混合注意力机制——融合softmax与线性令牌——以及仅需数个GPU日的轻量级蒸馏与微调流程。此外,我们引入了成本感知的块率策略,以平衡不同层间的表达能力与效率。将注意力手术应用于最先进的基于DiT的VDM Wan2.1 1.3B后,我们首次实现了具有竞争力的次二次注意力视频扩散模型,在FLOPs方面将注意力成本降低高达40%,同时在标准VBench和VBench-2.0基准测试中保持了生成质量。项目页面详见:https://qualcomm-ai-research.github.io/attention-surgery。