FedML Parrot: A Scalable Federated Learning System via Heterogeneity-aware Scheduling on Sequential and Hierarchical Training

Federated Learning (FL) enables collaborations among clients for train machine learning models while protecting their data privacy. Existing FL simulation platforms that are designed from the perspectives of traditional distributed training, suffer from laborious code migration between simulation and production, low efficiency, low GPU utility, low scalability with high hardware requirements and difficulty of simulating stateful clients. In this work, we firstly demystify the challenges and bottlenecks of simulating FL, and design a new FL system named as FedML \texttt{Parrot}. It improves the training efficiency, remarkably relaxes the requirements on the hardware, and supports efficient large-scale FL experiments with stateful clients by: (1) sequential training clients on devices; (2) decomposing original aggregation into local and global aggregation on devices and server respectively; (3) scheduling tasks to mitigate straggler problems and enhance computing utility; (4) distributed client state manager to support various FL algorithms. Besides, built upon our generic APIs and communication interfaces, users can seamlessly transform the simulation into the real-world deployment without modifying codes. We evaluate \texttt{Parrot} through extensive experiments for training diverse models on various FL datasets to demonstrate that \texttt{Parrot} can achieve simulating over 1000 clients (stateful or stateless) with flexible GPU devices setting ($4 \sim 32$) and high GPU utility, 1.2 $\sim$ 4 times faster than FedScale, and 10 $\sim$ 100 times memory saving than FedML. And we verify that \texttt{Parrot} works well with homogeneous and heterogeneous devices in three different clusters. Two FL algorithms with stateful clients and four algorithms with stateless clients are simulated to verify the wide adaptability of \texttt{Parrot} to different algorithms.

翻译：联邦学习（FL）使客户端能够协作训练机器学习模型，同时保护其数据隐私。现有从传统分布式训练视角设计的FL模拟平台存在以下问题：模拟与生产环境间的代码迁移繁琐、效率低下、GPU利用率低、硬件要求高导致可扩展性差，以及难以模拟有状态客户端。本文首先剖析了FL模拟的挑战与瓶颈，并设计了一个名为FedML \texttt{Parrot}的新型FL系统。该系统通过以下方式提升训练效率、显著降低硬件要求，并支持有状态客户端的高效大规模FL实验：（1）在设备上顺序训练客户端；（2）将原始聚合分解为设备端的局部聚合与服务器端的全局聚合；（3）任务调度以缓解滞后问题并提升计算利用率；（4）分布式客户端状态管理器以支持多种FL算法。此外，基于通用API和通信接口，用户无需修改代码即可将模拟无缝转化为真实部署。我们通过广泛实验在多种FL数据集上训练不同模型来评估\texttt{Parrot}，结果表明：在灵活的GPU设备配置（$4 \sim 32$）下，\texttt{Parrot}可模拟超过1000个客户端（有状态或无状态），实现高GPU利用率；速度较FedScale提升1.2$\sim$4倍，内存节省较FedML提升10$\sim$100倍。我们在三个不同集群中验证了\texttt{Parrot}在同构与异构设备上的良好运行效果。通过模拟两种有状态客户端FL算法和四种无状态客户端算法，验证了\texttt{Parrot}对不同算法的广泛适应性。