The extreme bandwidth demands of AI training has made load-balancing a critical component in AI fabrics, and a variety of load-balancing designs have emerged in recent work from both industry and research. However, there is currently little consensus on which design approach dominates or the conditions under which an approach dominates. We also lack an understanding of how far these approaches are from optimal. We provide a technical foundation for answering these questions by systematically evaluating leading load-balancing designs, while decoupling them from specific congestion control and loss recovery stacks. We find that load-balancing based on packet spraying dominates traditional approaches that load balance traffic at flow, flowlet, or subflow granularities. When comparing host- vs switch-based approaches to packet spraying, we find that they perform similarly in failure-free scenarios but that a host-based approach dominates under link failure because of its rapid visibility into end-to-end path conditions. We also identify that no leading approach achieves optimal O(1) queue scaling at maximum utilization. We demonstrate why a destination-based rotation (DR) discipline can reach this optimum and introduce Ofan, a switch-based implementation of DR that we show offers valuable performance gains over other packet spraying approaches.
翻译:AI训练的极端带宽需求使得负载均衡成为AI架构中的关键组成部分,近年来产业界和学术界已涌现出多种负载均衡设计方案。然而,目前对于何种设计方法占主导地位,或在何种条件下某种方法具有优势,尚未形成共识。我们亦缺乏对这些方法距离最优性能还有多远的理解。本文通过系统评估主流负载均衡设计,并将其与特定的拥塞控制和丢包恢复机制解耦,为回答这些问题提供了技术基础。研究发现,基于数据包喷洒的负载均衡方法在性能上优于传统的基于流、流片段或子流粒度的负载均衡方案。在比较基于主机与基于交换机的数据包喷洒方案时,我们发现两者在无故障场景下性能相近,但在链路故障场景下,基于主机的方法因其能快速感知端到端路径状态而更具优势。同时,我们指出现有主流方法均未能在最大利用率下实现最优的O(1)队列规模扩展。本文论证了基于目的地的轮转调度机制可达到该最优目标,并提出了Ofan——一种基于交换机的DR机制实现。实验表明,相比其他数据包喷洒方案,Ofan能带来显著的性能提升。