Federated learning (FL) systems enable multiple clients to train a machine learning model iteratively through synchronously exchanging the intermediate model weights with a single server. The scalability of such FL systems can be limited by two factors: server idle time due to synchronous communication and the risk of a single server becoming the bottleneck. In this paper, we propose a new FL architecture, to our knowledge, the first multi-server FL system that is entirely asynchronous, and therefore addresses these two limitations simultaneously. Our solution keeps both servers and clients continuously active. As in previous multi-server methods, clients interact solely with their nearest server, ensuring efficient update integration into the model. Differently, however, servers also periodically update each other asynchronously, and never postpone interactions with clients. We compare our solution to three representative baselines - FedAvg, FedAsync and HierFAVG - on the MNIST and CIFAR-10 image classification datasets and on the WikiText-2 language modeling dataset. Our solution converges to similar or higher accuracy levels than previous baselines and requires 61% less time to do so in geo-distributed settings.
翻译:联邦学习(FL)系统允许多个客户端通过与单一服务器同步交换中间模型权重来迭代训练机器学习模型。此类FL系统的可扩展性可能受到两个因素的限制:同步通信导致的服务器空闲时间,以及单一服务器成为瓶颈的风险。在本文中,我们提出了一种新的FL架构,据我们所知,这是首个完全异步的多服务器FL系统,从而同时解决了这两个局限性。我们的解决方案使服务器和客户端均保持持续活跃状态。与先前的多服务器方法类似,客户端仅与其最近的服务器交互,确保更新能高效整合到模型中。然而,不同之处在于,服务器之间也会周期性地进行异步更新,并且从不延迟与客户端的交互。我们在MNIST和CIFAR-10图像分类数据集以及WikiText-2语言建模数据集上,将我们的解决方案与三个代表性基线方法——FedAvg、FedAsync和HierFAVG——进行了比较。我们的解决方案能够收敛到与先前基线方法相似或更高的准确率水平,并且在地理分布式设置下实现这一目标所需的时间减少了61%。