Federated Learning (FL) is an emerging distributed machine learning (ML) technique that enables in-situ model training and inference on decentralized edge devices. We propose Totoro$^+$, a novel scalable FL system that enables massive FL applications to run simultaneously on edge networks. The key insight is to explore a distributed hash table (DHT)-based peer-to-peer (P2P) model to re-architect the centralized FL system design into a fully decentralized one. In contrast to previous studies where many FL applications shared one centralized parameter server, Totoro$^+$ assigns a dedicated parameter server to each application. Any edge node can act as any application's coordinator, aggregator, client selector, worker (participant device), or any combination of the above, thereby radically improving scalability and adaptivity. Totoro$^+$ introduces three innovations to realize its design: a locality-aware P2P multi-ring structure, a publish/subscribe-based forest abstraction, and a game-theoretic path planning model with a guarantee of an $ε$-approximate Nash equilibrium. Real-world experiments on 500 Amazon EC2 servers show that Totoro$^+$ scales gracefully with the number of FL applications and $N$ edge nodes speeds up the total training time by $1.2\times-14.0\times$, achieves $\mathcal{O}(\log N)$ hops for model dissemination and gradient aggregation with millions of nodes, and efficiently adapts to the practical edge networks and churns.
翻译:联邦学习(FL)是一种新兴的分布式机器学习技术,可在分散的边缘设备上实现原位模型训练与推理。我们提出Totoro$^+$,一种新型可扩展联邦学习系统,支持大量FL应用同时在边缘网络中运行。其核心思想在于利用基于分布式哈希表的点对点模型,将集中式FL系统设计重构为完全去中心化架构。与以往多个FL应用共享一个集中式参数服务器的研究不同,Totoro$^+$为每个应用分配专属的参数服务器。任意边缘节点均可充当任一应用的协调器、聚合器、客户端选择器、工作节点或上述角色的任意组合,从而根本性地提升可扩展性与自适应性。Totoro$^+$通过三项创新实现其设计:位置感知的P2P多环结构、基于发布/订阅的森林抽象,以及具有$\varepsilon$-近似纳什均衡保证的博弈论路径规划模型。在500台亚马逊EC2服务器上的实际实验表明,Totoro$^+$可随FL应用数量优雅扩展,$N$个边缘节点可将总训练速度提升$1.2\times-14.0\times$,在百万级节点场景下实现模型分发与梯度聚合的$\mathcal{O}(\log N)$跳复杂度,并能高效适应实际边缘网络环境与节点波动。