Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model's performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.
翻译:视频扩散模型已能实现高质量视频合成。然而,由于自注意力机制具有二次复杂度,其计算成本随序列长度呈二次方增长。虽然线性注意力能降低计算成本,但由于线性注意力的表达能力有限且视频生成中的时空建模复杂度高,完全替换二次注意力需要进行昂贵的预训练。本文提出LinVideo——一种高效的无数据后训练框架,可在保持原始模型性能的前提下,将目标数量的自注意力模块替换为线性注意力。首先,我们观察到不同层级的可替换性存在显著差异。我们摒弃手动或启发式选择方法,将层级选择构建为二分类问题,并提出选择性迁移策略,该策略能以最小性能损失自动且渐进地将层级转换为线性注意力。此外,为解决现有目标函数在此迁移过程中效率低下且效果不佳的问题,我们引入了一种任意时刻分布匹配目标,该目标能在采样轨迹的任意时间步对齐样本分布。该目标函数计算高效且能恢复模型性能。大量实验表明,我们的方法在保持生成质量的同时实现了1.25-2.00倍的加速,而经过4步蒸馏的模型进一步实现了15.92倍的延迟降低,且视觉质量下降极小。