Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal communication patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine communication, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided communication libraries. To address these issues, we present StreamFusion, a topology-aware efficient DiT serving engine. StreamFusion incorporates three key innovations: (1) a topology-aware sequence parallelism technique that accounts for inter- and intra-machine bandwidth differences, (2) Torus Attention, a novel SP technique enabling overlapping of inter-machine all-to-all operations with computation, and (3) a one-sided communication implementation that minimizes GPU sender-receiver synchronization and computation overheads. Our experiments demonstrate that StreamFusion outperforms the state-of-the-art approach by an average of $1.35\times$ (up to $1.77\times$).
翻译:扩散Transformer(DiTs)在高质量图像和视频生成领域得到了日益广泛的应用。随着对更高分辨率图像和更长视频需求的增长,单GPU推理因延迟增加和激活张量规模扩大而变得低效。现有框架采用Ulysses Attention和Ring Attention等序列并行(SP)技术来扩展推理规模。然而,这些实现存在三个主要局限:(1)针对现代GPU机器网络拓扑的通信模式并非最优;(2)机器间通信中全对全操作带来的延迟瓶颈;(3)使用双向通信库导致的GPU发送-接收端同步与计算开销。为解决这些问题,我们提出了StreamFusion,一种具备拓扑感知能力的高效DiT服务引擎。StreamFusion包含三项关键创新:(1)一种考虑机器间与机器内带宽差异的拓扑感知序列并行技术;(2)Torus Attention,一种新颖的SP技术,可实现机器间全对全操作与计算的重叠;(3)一种最小化GPU发送-接收端同步与计算开销的单边通信实现。实验结果表明,StreamFusion平均性能超越现有最佳方法$1.35\times$(最高可达$1.77\times$)。