FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

Large-scale diffusion models such as FLUX (12B parameters) and Stable Diffusion 3 (8B parameters) require multi-GPU parallelism for efficient inference. Unified Sequence Parallelism (USP), which combines Ulysses and Ring attention mechanisms, has emerged as the state-of-the-art approach for distributed attention computation. However, existing USP implementations suffer from significant inefficiencies including excessive kernel launch overhead and suboptimal computation-communication scheduling. In this paper, we propose \textbf{FastUSP}, a multi-level optimization framework that integrates compile-level optimization (graph compilation with CUDA Graphs and computation-communication reordering), communication-level optimization (FP8 quantized collective communication), and operator-level optimization (pipelined Ring attention with double buffering). We evaluate FastUSP on FLUX (12B) and Qwen-Image models across 2, 4, and 8 NVIDIA RTX 5090 GPUs. On FLUX, FastUSP achieves consistent \textbf{1.12$\times$--1.16$\times$} end-to-end speedup over baseline USP, with compile-level optimization contributing the dominant improvement. On Qwen-Image, FastUSP achieves \textbf{1.09$\times$} speedup on 2 GPUs; on 4--8 GPUs, we identify a PyTorch Inductor compatibility limitation with Ring attention that prevents compile optimization, while baseline USP scales to 1.30$\times$--1.46$\times$ of 2-GPU performance. We further provide a detailed analysis of the performance characteristics of distributed diffusion inference, revealing that kernel launch overhead -- rather than communication latency -- is the primary bottleneck on modern high-bandwidth GPU interconnects.

翻译：诸如FLUX（120亿参数）与Stable Diffusion 3（80亿参数）等大规模扩散模型需要多GPU并行以实现高效推理。统一序列并行（USP）融合了Ulysses与Ring注意力机制，已成为分布式注意力计算的前沿方法。然而，现有USP实现存在显著低效问题，包括过高的内核启动开销以及次优的计算-通信调度。本文提出\textbf{FastUSP}，一个集编译级优化（基于CUDA Graphs的图编译与计算-通信重排）、通信级优化（FP8量化集体通信）及算子级优化（双缓冲流水线Ring注意力）于一体的多层次优化框架。我们在2、4、8张NVIDIA RTX 5090 GPU上对FLUX（120亿参数）与Qwen-Image模型进行评估。在FLUX上，FastUSP相较基线USP实现了稳定的\textbf{1.12$\times$--1.16$\times$}端到端加速，其中编译级优化贡献了主要性能提升。在Qwen-Image上，FastUSP在2 GPU配置下实现\textbf{1.09$\times$}加速；在4-8 GPU配置中，我们发现PyTorch Inductor与Ring注意力存在兼容性限制导致编译优化无法生效，而基线USP可扩展至2 GPU性能的1.30$\times$--1.46$\times$。我们进一步对分布式扩散推理的性能特征进行了详细分析，揭示出现代高带宽GPU互连场景下的主要瓶颈在于内核启动开销，而非通信延迟。