Leaf-centric Logical Topology Design for OCS-based GPU Clusters

Xinchi Han,Weihao Jiang,Yingming Mao,Yike Liu,Zhuoran Liu,Yongxi Lv,Peirui Cao,Zhuotao Liu,Ximeng Liu,Xinbing Wang,Changbo Wu,Zihan Zhu,Wu Dongchao,Yang Jian,Zhang Zhanbang,Yuansen Chen,Shizhen Zhao

Recent years have witnessed the growing deployment of optical circuit switches (OCS) in commercial GPU clusters (e.g., Google A3 GPU cluster) optimized for machine learning (ML) workloads. Such clusters adopt a three-tier leaf-spine-OCS topology, servers attach to leaf-layer electronic packet switches (EPSes); these leaf switches aggregate into spine-layer EPSes to form a Pod; and multiple Pods are interconnected via core-layer OCSes. Unlike EPSes, OCSes only support circuit-based paths between directly connected spine switches, potentially inducing a phenomenon termed routing polarization, which refers to the scenario where the bandwidth requirements between specific pairs of Pods are unevenly fulfilled through links among different spine switches. The resulting imbalance induces traffic contention and bottlenecks on specific leaf-to-spine links, ultimately reducing ML training throughput. To mitigate this issue, we introduce a leaf-centric paradigm to ensure traffic originating from the same leaf switch is evenly distributed across multiple spine switches with balanced loads. Through rigorous theoretical analysis, we establish a sufficient condition for avoiding routing polarization and propose a corresponding logical topology design algorithm with polynomial-time complexity. Large-scale simulations validate up to 19.27% throughput improvement and a 99.16% reduction in logical topology computation overhead compared to Mixed Integer Programming (MIP)-based methods.

翻译：近年来，针对机器学习工作负载优化的光路交换机已在商业GPU集群（例如谷歌A3 GPU集群）中得到日益广泛的部署。此类集群采用三层叶-脊-光路交换机拓扑结构：服务器接入叶层电分组交换机；这些叶交换机汇聚至脊层电分组交换机以构成一个Pod；多个Pod则通过核心层光路交换机互连。与电分组交换机不同，光路交换机仅支持直连脊交换机之间的电路路径，这可能导致一种称为路由极化的现象，即特定Pod对之间的带宽需求在不同脊交换机间的链路上无法均衡满足。由此引发的不均衡会在特定叶到脊链路上产生流量竞争和瓶颈，最终降低机器学习训练吞吐量。为缓解该问题，我们提出以叶交换机为核心的范式，确保来自同一叶交换机的流量在多个脊交换机间均匀分布并实现负载均衡。通过严谨的理论分析，我们建立了避免路由极化的充分条件，并提出了具有多项式时间复杂度的相应逻辑拓扑设计算法。大规模仿真验证表明，与基于混合整数规划的方法相比，该方案可实现高达19.27%的吞吐量提升，并将逻辑拓扑计算开销降低99.16%。