In distributed training of machine learning models, gradient descent with local iterative steps is a very popular method, variants of which are commonly known as Local-SGD or the Federated Averaging (FedAvg). In this method, gradient steps based on local datasets are taken independently in distributed compute nodes to update the local models, which are then aggregated intermittently. Although the existing convergence analysis suggests that with heterogeneous data, FedAvg encounters quick performance degradation as the number of local steps increases, it is shown to work quite well in practice, especially in the distributed training of large language models. In this work we try to explain this good performance from a viewpoint of implicit bias in Local Gradient Descent (Local-GD) with a large number of local steps. In overparameterized regime, the gradient descent at each compute node would lead the model to a specific direction locally. We characterize the dynamics of the aggregated global model and compare it to the centralized model trained with all of the data in one place. In particular, we analyze the implicit bias of gradient descent on linear models, for both regression and classification tasks. Our analysis shows that the aggregated global model converges exactly to the centralized model for regression tasks, and converges (in direction) to the same feasible set as centralized model for classification tasks. We further propose a Modified Local-GD with a refined aggregation and theoretically show it converges to the centralized model in direction for linear classification. We empirically verified our theoretical findings in linear models and also conducted experiments on distributed fine-tuning of pretrained neural networks to further apply our theory.
翻译:在机器学习模型的分布式训练中,基于局部迭代步长的梯度下降是一种非常流行的方法,其变体通常被称为局部随机梯度下降(Local-SGD)或联邦平均(FedAvg)。在该方法中,基于局部数据集的梯度步长在分布式计算节点中独立执行以更新局部模型,随后这些模型被间歇性地聚合。尽管现有的收敛性分析表明,在异构数据场景下,随着局部步数增加,FedAvg 会遭遇性能快速下降,但实际应用中该方法表现良好,尤其是在大规模语言模型的分布式训练中。本工作试图从具有大量局部步长的局部梯度下降(Local-GD)的隐式偏差角度解释这一良好性能。在过参数化机制下,每个计算节点的梯度下降会在局部引导模型朝向特定方向。我们刻画了聚合全局模型的动态特性,并将其与集中式(所有数据置于一处训练)模型进行比较。具体而言,我们分析了梯度下降在线性模型(包括回归与分类任务)上的隐式偏差。我们的分析表明,对于回归任务,聚合全局模型精确收敛至集中式模型;对于分类任务,其(在方向上)收敛至与集中式模型相同的可行解集。我们进一步提出了一种改进的局部梯度下降方法,采用精细化聚合策略,并从理论上证明其在线性分类任务中能沿方向收敛至集中式模型。我们在线性模型中实证验证了理论发现,并在预训练神经网络的分布式微调上进行了实验,以进一步应用我们的理论。