Communication overhead is a significant bottleneck in federated learning (FL), which has been exaggerated with the increasing size of AI models. In this paper, we propose FedRDMA, a communication-efficient cross-silo FL system that integrates RDMA into the FL communication protocol. To overcome the limitations of RDMA in wide-area networks (WANs), FedRDMA divides the updated model into chunks and designs a series of optimization techniques to improve the efficiency and robustness of RDMA-based communication. We implement FedRDMA atop the industrial federated learning framework and evaluate it on a real-world cross-silo FL scenario. The experimental results show that \sys can achieve up to 3.8$\times$ speedup in communication efficiency compared to traditional TCP/IP-based FL systems.
翻译:通信开销是联邦学习(FL)的一个重大瓶颈,且随着AI模型规模的不断增长而愈发严重。本文提出FedRDMA——一种高效通信的跨数据孤岛FL系统,它将RDMA集成到FL通信协议中。为克服RDMA在广域网(WAN)中的局限性,FedRDMA将更新的模型划分为多个分块,并设计一系列优化技术以提升基于RDMA通信的效率和鲁棒性。我们在工业级联邦学习框架上实现了FedRDMA,并在真实跨数据孤岛FL场景中进行评估。实验结果表明,与传统基于TCP/IP的FL系统相比,该系统在通信效率上可实现最高3.8倍的加速。