In Federated Learning (FL) client devices connected over the internet collaboratively train a machine learning model without sharing their private data with a central server or with other clients. The seminal Federated Averaging (FedAvg) algorithm trains a single global model by performing rounds of local training on clients followed by model averaging. FedAvg can improve the communication-efficiency of training by performing more steps of Stochastic Gradient Descent (SGD) on clients in each round. However, client data in real-world FL is highly heterogeneous, which has been extensively shown to slow model convergence and harm final performance when $K > 1$ steps of SGD are performed on clients per round. In this work we propose decaying $K$ as training progresses, which can jointly improve the final performance of the FL model whilst reducing the wall-clock time and the total computational cost of training compared to using a fixed $K$. We analyse the convergence of FedAvg with decaying $K$ for strongly-convex objectives, providing novel insights into the convergence properties, and derive three theoretically-motivated decay schedules for $K$. We then perform thorough experiments on four benchmark FL datasets (FEMNIST, CIFAR100, Sentiment140, Shakespeare) to show the real-world benefit of our approaches in terms of real-world convergence time, computational cost, and generalisation performance.
翻译:在联邦学习(FL)中,通过互联网连接的客户端设备协同训练机器学习模型,而无需将其私有数据共享给中央服务器或其他客户端。经典的联邦平均(FedAvg)算法通过执行多轮客户端本地训练和模型平均来训练单个全局模型。通过每轮在客户端上执行更多步的随机梯度下降(SGD),FedAvg能够提升训练的通信效率。然而,现实世界FL中的客户端数据具有高度异质性,大量研究表明当每轮客户端执行$K > 1$步SGD时,这会导致模型收敛速度减慢并损害最终性能。本文提出随着训练进程逐步衰减$K$值,与使用固定$K$值相比,该方法能够协同提升FL模型的最终性能,同时减少训练耗时和总计算成本。我们针对强凸目标函数分析了衰减$K$的FedAvg收敛性,提供了关于收敛特性的全新见解,并推导出三种理论上驱动的$K$衰减策略。基于四个基准FL数据集(FEMNIST、CIFAR100、Sentiment140、Shakespeare)的充分实验证明,我们的方法在实际收敛时间、计算成本和泛化性能方面均展现出实际优势。