Federated Learning (FL) has gained increasing interest in recent years as a distributed on-device learning paradigm. However, multiple challenges remain to be addressed for deploying FL in real-world Internet-of-Things (IoT) networks with hierarchies. Although existing works have proposed various approaches to account data heterogeneity, system heterogeneity, unexpected stragglers and scalibility, none of them provides a systematic solution to address all of the challenges in a hierarchical and unreliable IoT network. In this paper, we propose an asynchronous and hierarchical framework (Async-HFL) for performing FL in a common three-tier IoT network architecture. In response to the largely varied delays, Async-HFL employs asynchronous aggregations at both the gateway and the cloud levels thus avoids long waiting time. To fully unleash the potential of Async-HFL in converging speed under system heterogeneities and stragglers, we design device selection at the gateway level and device-gateway association at the cloud level. Device selection chooses edge devices to trigger local training in real-time while device-gateway association determines the network topology periodically after several cloud epochs, both satisfying bandwidth limitation. We evaluate Async-HFL's convergence speedup using large-scale simulations based on ns-3 and a network topology from NYCMesh. Our results show that Async-HFL converges 1.08-1.31x faster in wall-clock time and saves up to 21.6% total communication cost compared to state-of-the-art asynchronous FL algorithms (with client selection). We further validate Async-HFL on a physical deployment and observe robust convergence under unexpected stragglers.
翻译:联邦学习(FL)作为分布式设备端学习范式近年来日益受到关注。然而,在具有分层结构的真实物联网(IoT)网络中部署FL仍面临多重挑战。现有工作虽针对数据异质性、系统异质性、意外掉队者及可扩展性等问题提出了不同解决方案,但均未能系统性地应对分层且不可靠的物联网网络中的所有挑战。本文提出了异步分层框架Async-HFL,用于在通用三级物联网网络架构中执行联邦学习。针对高度变化的延迟,Async-HFL在网关层和云端层均采用异步聚合机制,从而避免了长时间等待。为充分发挥Async-HFL在系统异质性和掉队者场景下的收敛速度潜力,我们在网关层设计设备选择策略,在云端层设计设备-网关关联机制:前者实时选择边缘设备触发本地训练,后者在若干个云端轮次后周期性更新网络拓扑,两者均满足带宽限制约束。基于ns-3仿真平台和NYCMesh网络拓扑的大规模仿真结果表明,与现有最先进的(含客户端选择的)异步联邦学习算法相比,Async-HFL的挂钟收敛速度提升1.08-1.31倍,总通信开销降低21.6%。我们进一步在物理部署中验证了Async-HFL,并观察到其在意外掉队者场景下的鲁棒收敛特性。