In federated learning, a large number of users are involved in a global learning task, in a collaborative way. They alternate local computations and communication with a distant server. Communication, which can be slow and costly, is the main bottleneck in this setting. To accelerate distributed gradient descent, the popular strategy of local training is to communicate less frequently; that is, to perform several iterations of local computations between the communication steps. A recent breakthrough in this field was made by Mishchenko et al. (2022): their Scaffnew algorithm is the first to probably benefit from local training, with accelerated communication complexity. However, it was an open and challenging question to know whether the powerful mechanism behind Scaffnew would be compatible with partial participation, the desirable feature that not all clients need to participate to every round of the training process. We answer this question positively and propose a new algorithm, which handles local training and partial participation, with state-of-the-art communication complexity.
翻译:摘要:在联邦学习中,大量用户以协作方式参与全局学习任务。用户交替执行本地计算,并与远程服务器进行通信。通信过程可能既缓慢又昂贵,成为该场景下的主要瓶颈。为加速分布式梯度下降,本地训练这一主流策略通过降低通信频率来优化:即在通信步之间执行多次本地计算迭代。该领域近期取得突破性进展,Mishchenko等人(2022)提出的Scaffnew算法首次从理论上证明本地训练能带来加速通信复杂度的优势。然而,Scaffnew强大机制能否兼容部分参与这一理想特性(即无需所有客户端参与每轮训练),此前仍是有待解决的挑战性问题。本研究给出肯定答案,并提出一种支持本地训练与部分参与的新算法,实现了当前最优的通信复杂度。