Diffusion Transformers (DiTs) have shown remarkable performance in generating high-quality videos. However, the quadratic complexity of 3D full attention remains a bottleneck in scaling DiT training, especially with high-definition, lengthy videos, where it can consume up to 95% of processing time and demand specialized context parallelism. This paper introduces DSV to accelerate video DiT training by leveraging the dynamic attention sparsity we empirically observe. DSV uses a two-stage algorithm to capture the dynamic sparsity patterns via low-rank based approximation of the original query and key. It employs custom kernels to efficiently identify critical key-value pairs and compute the sparse attention. To accommodate the new sparsity dimension, DSV adopts a hybrid sparsity-aware context parallelism that re-balances the skewed workload across attention heads and blocks due to sparsity heterogeneity. DSV achieves up to 3.02x higher training throughput, scaling to 128 GPUs and 520k token lengths, without quality loss.
翻译:扩散Transformer(DiT)在生成高质量视频方面展现出卓越性能。然而,三维全注意力的二次复杂度仍是扩展DiT训练规模的主要瓶颈,特别是在处理高分辨率长视频时,其可消耗高达95%的处理时间,并需要专门的上下文并行策略。本文提出DSV,通过利用我们实证观察到的动态注意力稀疏性来加速视频DiT训练。DSV采用两阶段算法,通过基于低秩近似的原始查询和键来捕获动态稀疏模式。该方法使用定制内核高效识别关键键值对并计算稀疏注意力。为适应新的稀疏维度,DSV采用混合稀疏感知上下文并行策略,重新平衡因稀疏性异质性导致的注意力头与块之间的负载倾斜。DSV在保证质量无损的前提下,实现了最高3.02倍的训练吞吐量提升,可扩展至128个GPU和52万令牌长度。