Recent years have witnessed the growing deployment of optical circuit switches (OCS) in commercial GPU clusters (e.g., Google A3 GPU cluster) optimized for machine learning (ML) workloads. Such clusters adopt a three-tier leaf-spine-OCS topology, servers attach to leaf-layer electronic packet switches (EPSes); these leaf switches aggregate into spine-layer EPSes to form a Pod; and multiple Pods are interconnected via core-layer OCSes. Unlike EPSes, OCSes only support circuit-based paths between directly connected spine switches, potentially inducing a phenomenon termed routing polarization, which refers to the scenario where the bandwidth requirements between specific pairs of Pods are unevenly fulfilled through links among different spine switches. The resulting imbalance induces traffic contention and bottlenecks on specific leaf-to-spine links, ultimately reducing ML training throughput. To mitigate this issue, we introduce a leaf-centric paradigm to ensure traffic originating from the same leaf switch is evenly distributed across multiple spine switches with balanced loads. Through rigorous theoretical analysis, we establish a sufficient condition for avoiding routing polarization and propose a corresponding logical topology design algorithm with polynomial-time complexity. Large-scale simulations validate up to 19.27% throughput improvement and a 99.16% reduction in logical topology computation overhead compared to Mixed Integer Programming (MIP)-based methods.
翻译:近年来,面向机器学习工作负载优化的商用GPU集群(例如谷歌A3 GPU集群)中,光路交换机(OCS)的部署日益增多。此类集群采用三层叶脊-OCS拓扑:服务器接入叶层电子分组交换机(EPS);这些叶交换机聚合到脊层EPS以构成一个Pod;多个Pod通过核心层OCS互连。与EPS不同,OCS仅支持直连脊交换机之间的电路路径,可能引发一种称为“路由极化”的现象——即特定Pod对之间的带宽需求通过不同脊交换机间的链路无法均衡满足。由此导致的不平衡会在特定叶到脊链路上引发流量争用和瓶颈,最终降低机器学习训练吞吐量。为解决此问题,我们提出一种以叶为中心的策略,确保源自同一叶交换机的流量均分到多个脊交换机上,实现负载均衡。通过严格的理论分析,我们建立了避免路由极化的充分条件,并提出一种对应的多项式时间复杂度的逻辑拓扑设计算法。大规模仿真验证表明,与基于混合整数规划的方法相比,吞吐量提升高达19.27%,逻辑拓扑计算开销降低99.16%。