Remote Direct Memory Access (RDMA) is widely used in data center networks because of its high performance. However, due to the characteristics of RDMA's retransmission strategy and the traffic mode of AI training, current load balancing schemes for data center networks are unsuitable for RDMA. In this paper, we propose SeqBalance, a load balancing framework designed for RDMA. SeqBalance implements fine-grained load balancing for RDMA through a reasonable design and does not cause reordering problems. SeqBalance's designs are all based on existing commercial RNICs and commercial programmable switches, so they are compatible with existing data center networks. We have implemented SeqBalance in Mellanox CX-6 RNICs and Tofino switches. The results of hardware testbed experiments and large-scale simulations show that compared with existing load balancing schemes, SeqBalance improves 18.7% and 33.2% on average FCT and 99th percentile FCT.
翻译:远程直接内存访问(RDMA)因其高性能特性在数据中心网络中广泛应用。然而,受限于RDMA的重传策略特性与人工智能训练的业务流量模式,现有数据中心网络负载均衡方案难以适配RDMA场景。本文提出SeqBalance——专为RDMA设计的负载均衡框架。该框架通过合理设计实现细粒度RDMA负载均衡,且不会引发数据包乱序问题。SeqBalance的所有设计均基于现有商用RDMA网卡与可编程交换机,具备对现有数据中心网络的兼容性。我们在Mellanox CX-6 RNIC与Tofino交换机上实现了SeqBalance原型系统。硬件测试平台实验与大规模仿真结果表明:相较于现有负载均衡方案,SeqBalance在平均流完成时间与尾部(99百分位)流完成时间上分别实现了18.7%与33.2%的性能提升。