All-to-all GPU communication is a critical bottleneck in large-scale training clusters, where completion time is constrained by per-port bandwidth and can be severely impacted by traffic skew across GPUs and network interface cards (NICs). This issue is amplified by the two-tier structure of modern GPU systems, which combine fast intra-server links with much slower inter-server networks. Motivated by recent system observations that highlight the importance of traffic reshaping and hierarchy awareness, we study all-to-all scheduling from an online switching and queueing-theoretic perspective. We propose a dynamic hierarchical Birkhoff--von Neumann (BvN) decomposition framework tailored to two-tier GPU fabrics. At each frame boundary, traffic is first balanced within each server using simple local operations to mitigate micro-level GPU/NIC skew while preserving aggregate server-to-server demand. A hierarchical BvN decomposition is then applied at the server level and refined into GPU-level matchings, significantly reducing decomposition complexity relative to a flat GPU-level approach. By integrating this construction with the dynamic frame sizing (DFS) principle, we obtain an online scheduler with provable stability under admissible Poisson arrivals. Simulations demonstrate substantial reductions in mean frame length, particularly under server-localized hotspot traffic.
翻译:GPU全对全通信是大规模训练集群中的关键性能瓶颈,其完成时间受限于单端口带宽,并可能因GPU与网络接口卡(NIC)间的流量倾斜而严重恶化。现代GPU系统的双层架构加剧了这一问题,其将高速服务器内链路与低速得多的服务器间网络相结合。基于近期系统观测中强调的流量整形与层次感知的重要性,我们从在线交换与排队论视角研究全对全通信调度问题。本文提出一种专为双层GPU互连结构设计的动态层次化伯克霍夫-冯·诺依曼(BvN)分解框架。在每个帧边界处,首先通过简单的本地操作在各服务器内部平衡流量,以缓解微观层面的GPU/NIC倾斜,同时保持服务器间聚合需求不变。随后在服务器层级应用层次化BvN分解,并将其细化为GPU层级的匹配方案,相较于扁平化的GPU层级方法显著降低了分解复杂度。通过将此构建方法与动态帧长调整(DFS)原则相结合,我们获得了一种在允许的泊松到达条件下具有可证明稳定性的在线调度器。仿真结果表明,该框架能显著降低平均帧长度,在服务器局部热点流量场景下效果尤为突出。