Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.
翻译:近期扩散模型虽能生成高质量视频,但存在运行速度缓慢的问题。这些模型所采用的大型基于Transformer的主干网络受限于时空注意力机制。本文发现,在不同输入条件下,大量词元间连接始终产生可忽略的注意力分数,且其模式在多个查询中反复出现。因此,在几乎不影响生成结果的前提下,可跳过此类注意力计算。这一现象在局部词元块间的连接中同样成立。基于此观察,我们提出CalibAtt——一种无需训练的校准稀疏注意力方法,用于加速视频生成。CalibAtt通过离线校准过程识别具有跨输入稳定性的块级稀疏与重复模式,并将这些模式编译为针对各网络层、注意力头及扩散时间步的优化注意力操作。在推理阶段,我们以硬件高效的方式密集计算选定的输入相关连接,同时跳过未选定的连接。在Wan 2.1 14B、Mochi 1及多种分辨率下的少步蒸馏模型上进行的大量实验表明,CalibAtt可实现最高1.58倍的端到端加速,在保持视频生成质量与文本-视频对齐度的同时,性能优于现有无需训练的方法。