In distributed training of machine learning models, gradient descent with local iterative steps, commonly known as Local (Stochastic) Gradient Descent (Local-(S)GD) or Federated averaging (FedAvg), is a very popular method to mitigate communication burden. In this method, gradient steps based on local datasets are taken independently in distributed compute nodes to update the local models, which are then aggregated intermittently. In the interpolation regime, Local-GD can converge to zero training loss. However, with many potential solutions corresponding to zero training loss, it is not known which solution Local-GD converges to. In this work we answer this question by analyzing implicit bias of Local-GD for classification tasks with linearly separable data. For the interpolation regime, our analysis shows that the aggregated global model obtained from Local-GD, with arbitrary number of local steps, converges exactly to the model that would be obtained if all data were in one place (centralized model) ''in direction''. Our result gives the exact rate of convergence to the centralized model with respect to the number of local steps. We also obtain the same implicit bias with a learning rate independent of number of local steps with a modified version of the Local-GD algorithm. Our analysis provides a new view to understand why Local-GD can still perform well with a very large number of local steps even for heterogeneous data. Lastly, we also discuss the extension of our results to Local-SGD and non-separable data.
翻译:在机器学习模型的分布式训练中,带有局部迭代步长的梯度下降法(通常称为局部(随机)梯度下降(Local-(S)GD)或联邦平均(FedAvg))是一种广受欢迎的减轻通信负担的方法。该方法中,分布式计算节点独立地基于本地数据集执行梯度步骤以更新局部模型,随后间歇性地对这些局部模型进行聚合。在插值区间内,Local-GD能够收敛到零训练损失。然而,对于对应零训练损失的众多潜在解而言,我们尚不清楚Local-GD会收敛到哪个解。本文通过分析Local-GD在线性可分数据分类任务中的隐式偏差来回答这一问题。针对插值区间,我们的分析表明,从Local-GD获得的聚合全局模型(无论采用任意数量的局部步长)在“方向”上精确收敛到将所有数据集中一处(即中心化模型)所得到的模型。我们的结果给出了关于局部步长数量下收敛至中心化模型的精确速率。此外,通过Local-GD算法的改进版本,我们获得了与局部步长数量无关的学习率下的相同隐式偏差。我们的分析为理解Local-GD在处理异构数据时即使采用大量局部步长仍能表现良好的原因提供了新视角。最后,我们还讨论了将我们的结果推广至Local-SGD及非可分数据的情况。