Next-generation datacenters require highly efficient network load balancing to manage the growing scale of artificial intelligence (AI) training and general datacenter traffic. However, existing Ethernet-based solutions, such as Equal Cost Multi-Path (ECMP) and oblivious packet spraying (OPS), struggle to maintain high network utilization due to both increasing traffic demands and the expanding scale of datacenter topologies, which also exacerbate network failures. To address these limitations, we propose REPS, a lightweight decentralized per-packet adaptive load balancing algorithm designed to optimize network utilization while ensuring rapid recovery from link failures. REPS adapts to network conditions by caching good-performing paths. In case of a network failure, REPS re-routes traffic away from it in less than 100 microseconds. REPS is designed to be deployed with next-generation out-of-order transports, such as Ultra Ethernet, and uses less than 25 bytes of per-connection state regardless of the topology size. We extensively evaluate REPS in large-scale simulations and FPGA-based NICs.
翻译:下一代数据中心需要高效的网络负载均衡技术,以应对日益增长的人工智能(AI)训练流量和通用数据中心流量的规模扩展。然而,现有的基于以太网的解决方案,如等价多路径(ECMP)和无状态包喷洒(OPS),由于流量需求的持续增长和数据中心拓扑规模的不断扩大,难以维持较高的网络利用率,同时网络故障问题也进一步加剧。为应对这些挑战,本文提出REPS——一种轻量级、去中心化的逐包自适应负载均衡算法,旨在优化网络利用率并确保链路故障的快速恢复。REPS通过缓存性能良好的路径来适应网络状态变化;当发生网络故障时,REPS能在100微秒内将流量从故障路径移开。该算法设计用于与下一代乱序传输协议(如超以太网)协同部署,且无论拓扑规模如何,每个连接仅需少于25字节的状态存储。我们通过大规模仿真和基于FPGA的网卡对REPS进行了全面评估。