Federated Learning (FL) is a distributed learning paradigm that empowers edge devices to collaboratively learn a global model leveraging local data. Simulating FL on GPU is essential to expedite FL algorithm prototyping and evaluations. However, current FL frameworks overlook the disparity between algorithm simulation and real-world deployment, which arises from heterogeneous computing capabilities and imbalanced workloads, thus misleading evaluations of new algorithms. Additionally, they lack flexibility and scalability to accommodate resource-constrained clients. In this paper, we present FedHC, a scalable federated learning framework for heterogeneous and resource-constrained clients. FedHC realizes system heterogeneity by allocating a dedicated and constrained GPU resource budget to each client, and also simulates workload heterogeneity in terms of framework-provided runtime. Furthermore, we enhance GPU resource utilization for scalable clients by introducing a dynamic client scheduler, process manager, and resource-sharing mechanism. Our experiments demonstrate that FedHC has the capability to capture the influence of various factors on client execution time. Moreover, despite resource constraints for each client, FedHC achieves state-of-the-art efficiency compared to existing frameworks without limits. When subjecting existing frameworks to the same resource constraints, FedHC achieves a 2.75x speedup. Code has been released on https://github.com/if-lab-repository/FedHC.
翻译:联邦学习(FL)是一种分布式学习范式,使边缘设备能够利用本地数据协作学习全局模型。在GPU上模拟FL对加速FL算法原型设计与评估至关重要。然而,当前FL框架忽略了算法模拟与实际部署之间的差异——这种差异源于异构计算能力与不均衡的工作负载,从而误导新算法的评估。此外,现有框架缺乏适应资源受限客户端的灵活性与可扩展性。本文提出FedHC,一种面向异构与资源受限客户端的可扩展联邦学习框架。FedHC通过为每个客户端分配专用且受限的GPU资源预算来实现系统异构性,并基于框架提供的运行时特征模拟工作负载异构性。此外,我们引入动态客户端调度器、进程管理器与资源共享机制,以提升可扩展客户端对GPU资源的利用率。实验表明,FedHC能够捕捉多种因素对客户端执行时间的影响。尽管每个客户端面临资源限制,FedHC在效率上仍达到相较于无限制现有框架的最优水平。当对现有框架施加相同资源限制时,FedHC实现了2.75倍的加速。代码已发布在https://github.com/if-lab-repository/FedHC。