Modern GPU-based high-performance computing clusters offer unprecedented communication bandwidth through heterogeneous intra-node interconnects and inter-node networks. However, despite this high aggregate bandwidth, many real-world communication patterns fail to fully utilize the available hardware. Traffic skew often leads to situations where a small subset of links becomes oversaturated while others remain underutilized, resulting in congestion, latency spikes, and poor scalability. Existing communication frameworks such as NCCL and MPI with UCX typically rely on static fastest-path routing or hashing-based multi-rail striping, which leaves significant bandwidth unused when runtime traffic deviates from expected distributions. To address these limitations, we propose NIMBLE (Node-Interconnect Multi-path Balancing with Execution-time orchestration), a runtime communication orchestration system that dynamically redistributes traffic to balance link utilization across all available intra-node and inter-node paths. NIMBLE formulates this as a capacity-normalized minimum-congestion optimization problem and solves it efficiently using a multiplicative-weights algorithm. It further employs CUDA-aware GPU kernel-based RDMA pipelining to route traffic through intermediate GPUs and rail-matched NICs. The system is endpoint-driven, integrates transparently with existing communication libraries without requiring application changes, and preserves ordering, determinism, and low overhead. On H100-SXM4 nodes with fully connected NVLink and four NDR400 rails, NIMBLE achieves up to 2.3x higher intra-node bandwidth and 3.8x higher inter-node throughput compared to single-path baselines. It outperforms NCCL and MPI by up to 5.2x on skewed All-to-Allv workloads and 1.35x on end-to-end LLM MoE workloads, while matching baseline performance under balanced traffic.
翻译:现代基于 GPU 的高性能计算集群通过异构的节点内互连和节点间网络提供了前所未有的通信带宽。然而,尽管总带宽很高,许多实际通信模式仍未能充分利用可用硬件。流量偏斜常常导致少量链路过度饱和而其他链路利用率不足,从而引发拥塞、延迟波动和可扩展性下降。现有的通信框架(如 NCCL 和基于 UCX 的 MPI)通常依赖静态的最快路径路由或基于哈希的多轨条带化,当运行时流量偏离预期分布时,会留下大量未使用的带宽。为解决这些局限性,我们提出了 NIMBLE(节点互连多路径均衡与执行时编排),这是一种运行时通信编排系统,能够动态重新分配流量以平衡所有可用节点内和节点间路径的链路利用率。NIMBLE 将这一问题建模为容量归一化的最小拥塞优化问题,并使用乘法权重算法高效求解。该系统进一步利用支持 CUDA 的 GPU 内核基 RDMA 流水线,通过中间 GPU 和匹配轨道的 NIC 路由流量。该系统是端点驱动的,能够透明地与现有通信库集成,无需修改应用程序,并保留了顺序性、确定性和低开销特性。在配备全连接的 NVLink 和四条 NDR400 轨道的 H100-SXM4 节点上,与单路径基线相比,NIMBLE 实现了高达 2.3 倍的节点内带宽提升和 3.8 倍的节点间吞吐量提升。在偏斜的 All-to-Allv 工作负载上,它比 NCCL 和 MPI 提升了最高 5.2 倍;在端到端 LLM MoE 工作负载上,提升了 1.35 倍,同时在均衡流量下保持了与基线相当的性能。