Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal communication patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine communication, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided communication libraries. To address these issues, we present StreamFusion, a topology-aware efficient DiT serving engine. StreamFusion incorporates three key innovations: (1) a topology-aware sequence parallelism technique that accounts for inter- and intra-machine bandwidth differences, (2) Torus Attention, a novel SP technique enabling overlapping of inter-machine all-to-all operations with computation, and (3) a one-sided communication implementation that minimizes GPU sender-receiver synchronization and computation overheads. Our experiments demonstrate that StreamFusion outperforms the state-of-the-art approach by an average of $1.35\times$ (up to $1.77\times$).
翻译:扩散Transformer(DiTs)在高清图像与视频生成领域应用日益广泛。随着高分辨率图像与长视频需求增长,单GPU推理因延迟增加与激活值规模变大而效率低下。现有框架采用序列并行技术(如Ulysses Attention和Ring Attention)来扩展推理能力,但存在三大局限:(1)通信模式与当代GPU集群网络拓扑不匹配;(2)跨机通信中的全收集操作造成延迟瓶颈;(3)基于双边通信库的GPU收发同步与计算开销。为此,我们提出StreamFusion——一种拓扑感知的高效DiT服务引擎。该引擎包含三项关键创新:(1)感知机内与机间带宽差异的拓扑感知序列并行技术;(2)Torus Attention——一种新型SP技术,可实现跨机全收集操作与计算的重叠;(3)基于单边通信的实现,最大程度降低GPU收发同步与计算开销。实验表明,StreamFusion平均性能提升至现有方法的1.35倍(最高可达1.77倍)。