Next-generation datacenters require highly efficient network load balancing to manage the growing scale of artificial intelligence (AI) training and general datacenter traffic. Existing solutions designed for Ethernet, such as Equal Cost Multi-Path (ECMP) and oblivious packet spraying (OPS), struggle to maintain high network utilizations as datacenter topologies (and network failures as a consequence) continue to grow. To address these limitations, we propose ARCANE, a lightweight decentralized per-packet adaptive load balancing algorithm designed to optimize network utilization while ensuring rapid recovery from link failures. ARCANE adapts to network conditions by caching good-performing paths. In case of a network failure, ARCANE re-routes traffic away from it in less than 100 microseconds. ARCANE is designed to be deployed with next-generation out-of-order transports, such as Ultra Ethernet, and introduces less than 25 bytes of per-connection state. We extensively evaluate ARCANE in large-scale simulations and FPGA-based NICs.
翻译:下一代数据中心需要高效的网络负载均衡机制,以应对日益增长的人工智能(AI)训练及通用数据中心流量。现有针对以太网设计的解决方案,如等价多路径(ECMP)和 oblivious packet spraying(OPS),随着数据中心拓扑(以及随之而来的网络故障)规模持续扩大,难以维持较高的网络利用率。为应对这些局限,我们提出了ARCANE——一种轻量级、分布式的逐包自适应负载均衡算法,旨在优化网络利用率,同时确保链路故障的快速恢复。ARCANE通过缓存性能良好的路径来适应网络状况。当发生网络故障时,ARCANE能在100微秒内将流量从故障路径移开。ARCANE设计用于与下一代乱序传输协议(如超以太网)协同部署,且每个连接仅需引入少于25字节的状态开销。我们通过大规模仿真和基于FPGA的网卡对ARCANE进行了全面评估。