Diffusion Transformers (DiTs) are increasingly adopted in scientific computing, yet growing model sizes and resolutions make distributed multi-GPU inference essential. Ulysses sequence parallelism scales DiT inference but introduces frequent all-to-all collectives that dominate latency. Overlapping these with computation is difficult due to tight data dependencies, large message volumes, and asymmetric interconnect bandwidths. We introduce CoCoDiff, a distributed DiT inference engine exploiting two observations: (1) V requires only linear projection while Q/K need additional normalization and RoPE, creating opportunities to overlap V's communication with Q/K computation; (2) adjacent denoising steps produce similar tensors, yielding temporal redundancy. CoCoDiff introduces three mechanisms: Tile-Aware Parallel All-to-all (TAPA) decomposes collectives into topology-aligned phases; V-First scheduling hides V's communication behind Q/K computation; and V-Major selective communication transmits only active projections on slow interconnects. On the Aurora supercomputer with four DiT models across 1-8 nodes (up to 96 Intel GPU tiles), CoCoDiff achieves an average speedup of 3.6x, peaking at 8.4x.
翻译:扩散变压器在科学计算中日益普及,但不断增长的模型规模与分辨率使得分布式多GPU推理成为关键。Ulysses序列并行可扩展DiT推理,但引入频繁的全对全集体通信导致延迟主导。由于紧密的数据依赖、大消息体量及非对称互联带宽,这些通信与计算的重叠极为困难。我们提出CoCoDiff,一种分布式DiT推理引擎,基于两项关键观察:(1)值向量仅需线性投影,而查询/键需额外归一化与旋转位置编码,这为将值向量的通信与查询/键计算重叠创造了机会;(2)相邻去噪步骤产生相似张量,形成时间冗余。CoCoDiff引入三种机制:分片感知并行全对全将集体通信分解为拓扑对齐阶段;值优先调度将值向量的通信隐藏在查询/键计算之后;主值选择性通信仅在慢速互联上传输激活投影。在采用四个DiT模型、1-8个节点(最多96个Intel GPU分片)的Aurora超级计算机上,CoCoDiff实现平均3.6倍加速,峰值达8.4倍。