Asynchronous Multi-Server Federated Learning for Geo-Distributed Clients

Federated learning (FL) systems enable multiple clients to train a machine learning model iteratively through synchronously exchanging the intermediate model weights with a single server. The scalability of such FL systems can be limited by two factors: server idle time due to synchronous communication and the risk of a single server becoming the bottleneck. In this paper, we propose a new FL architecture, to our knowledge, the first multi-server FL system that is entirely asynchronous, and therefore addresses these two limitations simultaneously. Our solution keeps both servers and clients continuously active. As in previous multi-server methods, clients interact solely with their nearest server, ensuring efficient update integration into the model. Differently, however, servers also periodically update each other asynchronously, and never postpone interactions with clients. We compare our solution to three representative baselines - FedAvg, FedAsync and HierFAVG - on the MNIST and CIFAR-10 image classification datasets and on the WikiText-2 language modeling dataset. Our solution converges to similar or higher accuracy levels than previous baselines and requires 61% less time to do so in geo-distributed settings.

翻译：联邦学习（FL）系统允许多个客户端通过与单一服务器同步交换中间模型权重来迭代训练机器学习模型。此类FL系统的可扩展性可能受到两个因素的限制：同步通信导致的服务器空闲时间，以及单一服务器成为瓶颈的风险。在本文中，我们提出了一种新的FL架构，据我们所知，这是首个完全异步的多服务器FL系统，从而同时解决了这两个局限性。我们的解决方案使服务器和客户端均保持持续活跃状态。与先前的多服务器方法类似，客户端仅与其最近的服务器交互，确保更新能高效整合到模型中。然而，不同之处在于，服务器之间也会周期性地进行异步更新，并且从不延迟与客户端的交互。我们在MNIST和CIFAR-10图像分类数据集以及WikiText-2语言建模数据集上，将我们的解决方案与三个代表性基线方法——FedAvg、FedAsync和HierFAVG——进行了比较。我们的解决方案能够收敛到与先前基线方法相似或更高的准确率水平，并且在地理分布式设置下实现这一目标所需的时间减少了61%。

相关内容

服务器

关注 14

服务器，也称伺服器，是提供计算服务的设备。由于服务器需要响应服务请求，并进行处理，因此一般来说服务器应具备承担服务并且保障服务的能力。
服务器的构成包括处理器、硬盘、内存、系统总线等，和通用的计算机架构类似，但是由于需要提供高可靠的服务，因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日