A Federated Learning (FL) system typically consists of two core processing entities: the federation controller and the learners. The controller is responsible for managing the execution of FL workflows across learners and the learners for training and evaluating federated models over their private datasets. While executing an FL workflow, the FL system has no control over the computational resources or data of the participating learners. Still, it is responsible for other operations, such as model aggregation, task dispatching, and scheduling. These computationally heavy operations generally need to be handled by the federation controller. Even though many FL systems have been recently proposed to facilitate the development of FL workflows, most of these systems overlook the scalability of the controller. To meet this need, we designed and developed a novel FL system called MetisFL, where the federation controller is the first-class citizen. MetisFL re-engineers all the operations conducted by the federation controller to accelerate the training of large-scale FL workflows. By quantitatively comparing MetisFL against other state-of-the-art FL systems, we empirically demonstrate that MetisFL leads to a 10-fold wall-clock time execution boost across a wide range of challenging FL workflows with increasing model sizes and federation sites.
翻译:联邦学习(FL)系统通常由两类核心处理实体构成:联邦控制器与学习者。控制器负责跨学习者管理FL工作流的执行,而学习者则负责在其私有数据集上训练与评估联邦模型。在执行FL工作流时,联邦学习系统无法控制参与学习者的计算资源或数据,但仍需承担模型聚合、任务分发与调度等操作职责。这些计算密集型操作通常需由联邦控制器处理。尽管近年来涌现出大量促进FL工作流开发的联邦学习系统,但多数系统忽视了控制器的可扩展性。为满足这一需求,我们设计并开发了名为MetisFL的新型联邦学习系统,其中联邦控制器被赋予核心地位。MetisFL重新设计了联邦控制器执行的所有操作,以加速大规模FL工作流的训练。通过将MetisFL与现有最优联邦学习系统进行定量对比,我们实证表明:在模型规模与联邦站点数持续增长的各类具有挑战性的FL工作流中,MetisFL能实现10倍的实际执行时间加速。