Federated learning is a popular distributed learning approach for training a machine learning model without disclosing raw data. It consists of a parameter server and a possibly large collection of clients (e.g., in cross-device federated learning) that may operate in congested and changing environments. In this paper, we study federated learning in the presence of stochastic and dynamic communication failures wherein the uplink between the parameter server and client $i$ is on with unknown probability $p_i^t$ in round $t$. Furthermore, we allow the dynamics of $p_i^t$ to be arbitrary. We first demonstrate that when the $p_i^t$'s vary across clients, the most widely adopted federated learning algorithm, Federated Average (FedAvg), experiences significant bias. To address this observation, we propose Federated Postponed Broadcast (FedPBC), a simple variant of FedAvg. FedPBC differs from FedAvg in that the parameter server postpones broadcasting the global model till the end of each round. Despite uplink failures, we show that FedPBC converges to a stationary point of the original non-convex objective. On the technical front, postponing the global model broadcasts enables implicit gossiping among the clients with active links in round $t$. Despite the time-varying nature of $p_i^t$, we can bound the perturbation of the global model dynamics using techniques to control gossip-type information mixing errors. Extensive experiments have been conducted on real-world datasets over diversified unreliable uplink patterns to corroborate our analysis.
翻译:联邦学习是一种流行的分布式学习方法,可在不泄露原始数据的情况下训练机器学习模型。该方法由一个参数服务器和可能大量分布的客户端(例如跨设备联邦学习)组成,这些客户端可能在拥塞且动态变化的环境中运行。本文研究了在随机动态通信故障情况下的联邦学习,其中参数服务器与客户端$i$之间的上行链路在第$t$轮以未知概率$p_i^t$处于连通状态。此外,我们允许$p_i^t$的动力学特性具有任意性。我们首先证明,当$p_i^t$在不同客户端之间变化时,最广泛采用的联邦学习算法——联邦平均(FedAvg)会出现显著偏差。针对这一发现,我们提出联邦延迟广播(FedPBC)算法,这是FedAvg的一个简单变体。FedPBC与FedAvg的区别在于,参数服务器将全局模型的广播推迟至每轮结束时。尽管存在上行链路故障,我们证明FedPBC能收敛到原始非凸目标的驻点。在技术层面,延迟全局模型广播使得第$t$轮中具有活跃链路的客户端之间能够实现隐式闲聊。尽管$p_i^t具有时变特性,我们仍能通过控制闲聊型信息混合误差的技术来约束全局模型动态的扰动。我们在现实数据集上开展了大量实验,覆盖多种不可靠上行链路模式,以验证我们的理论分析。