Ethereal：大规模分布式训练中的分治网络负载均衡 (Ethereal: Divide and Conquer Network Load Balancing in Large-Scale Distributed Training)

Large-scale distributed training in production datacenters constitutes a challenging workload bottlenecked by network communication. In response, both major industry players (e.g., Ultra Ethernet Consortium) and parts of academia have surprisingly, and almost unanimously, agreed that packet spraying is \emph{necessary} to improve the performance of large-scale distributed training workloads. In this paper, we challenge this prevailing belief and pose the question: \emph{How close can singlepath transport come to matching the performance of packet spraying?} We demonstrate that singlepath transport (from a NIC's perspective) is sufficient and can perform nearly as well as ideal packet spraying, particularly in the context of distributed training in CLOS-based topologies. Our assertion is based on four key observations about workloads driven by collective communication patterns: \emph{(i)} flow sizes are known upon arrival, \emph{(ii)} flow sizes are equal within each step of a collective, \emph{(iii)} the completion time of a collective is more critical than individual flow completion times, and \emph{(iv)} flows can be \emph{split} upon arrival to control load balancing directly from the application layer. We present Ethereal, a simple distributed load balancing algorithm that opportunistically splits flows and assigns paths to each flow in a transparent manner, requiring little to no changes to existing RDMA NICs. Our evaluation, spanning a wide range of collective communication algorithms and GPT models using Astra-Sim, shows that Ethereal significantly reduces the completion times by up to $30\%$ compared to packet spraying and by up to $40\%$ compared to REPS, even under link failures. This paper offers an alternative perspective for developing next-generation transport protocols tailored to large-scale distributed training.

翻译：生产数据中心的大规模分布式训练构成了受网络通信瓶颈制约的挑战性工作负载。对此，主要行业参与者（如超以太网联盟）和部分学术界令人惊讶且几乎一致地认为，数据包喷洒对于提升大规模分布式训练工作负载的性能是**必要**的。本文挑战这一主流观点，并提出问题：**单路径传输在性能上能多接近数据包喷洒？** 我们证明，单路径传输（从网卡视角看）是足够的，并且能够达到近乎理想数据包喷洒的性能，特别是在基于CLOS拓扑的分布式训练场景中。我们的论断基于对由集合通信模式驱动的工作负载的四个关键观察：**（i）** 流大小在到达时已知，**（ii）** 在集合通信的每一步内流大小相等，**（iii）** 集合通信的完成时间比单个流的完成时间更为关键，以及**（iv）** 流可以在到达时被**分割**，以便直接从应用层控制负载均衡。我们提出了Ethereal，一种简单的分布式负载均衡算法，它机会性地分割流并以透明方式为每个流分配路径，几乎无需对现有RDMA网卡进行改动。我们使用Astra-Sim在广泛的集合通信算法和GPT模型上进行评估，结果表明，即使在链路故障情况下，与数据包喷洒相比，Ethereal将完成时间显著降低了高达$30\%$，与REPS相比降低了高达$40\%$。本文为开发面向大规模分布式训练的下一代传输协议提供了一个替代视角。