Distributed Gradient Descent with Many Local Steps in Overparameterized Models

In distributed training of machine learning models, gradient descent with local iterative steps is a very popular method, variants of which are commonly known as Local-SGD or the Federated Averaging (FedAvg). In this method, gradient steps based on local datasets are taken independently in distributed compute nodes to update the local models, which are then aggregated intermittently. Although the existing convergence analysis suggests that with heterogeneous data, FedAvg encounters quick performance degradation as the number of local steps increases, it is shown to work quite well in practice, especially in the distributed training of large language models. In this work we try to explain this good performance from a viewpoint of implicit bias in Local Gradient Descent (Local-GD) with a large number of local steps. In overparameterized regime, the gradient descent at each compute node would lead the model to a specific direction locally. We characterize the dynamics of the aggregated global model and compare it to the centralized model trained with all of the data in one place. In particular, we analyze the implicit bias of gradient descent on linear models, for both regression and classification tasks. Our analysis shows that the aggregated global model converges exactly to the centralized model for regression tasks, and converges (in direction) to the same feasible set as centralized model for classification tasks. We further propose a Modified Local-GD with a refined aggregation and theoretically show it converges to the centralized model in direction for linear classification. We empirically verified our theoretical findings in linear models and also conducted experiments on distributed fine-tuning of pretrained neural networks to further apply our theory.

翻译：在机器学习模型的分布式训练中，基于局部迭代步长的梯度下降是一种非常流行的方法，其变体通常被称为局部随机梯度下降（Local-SGD）或联邦平均（FedAvg）。在该方法中，基于局部数据集的梯度步长在分布式计算节点中独立执行以更新局部模型，随后这些模型被间歇性地聚合。尽管现有的收敛性分析表明，在异构数据场景下，随着局部步数增加，FedAvg 会遭遇性能快速下降，但实际应用中该方法表现良好，尤其是在大规模语言模型的分布式训练中。本工作试图从具有大量局部步长的局部梯度下降（Local-GD）的隐式偏差角度解释这一良好性能。在过参数化机制下，每个计算节点的梯度下降会在局部引导模型朝向特定方向。我们刻画了聚合全局模型的动态特性，并将其与集中式（所有数据置于一处训练）模型进行比较。具体而言，我们分析了梯度下降在线性模型（包括回归与分类任务）上的隐式偏差。我们的分析表明，对于回归任务，聚合全局模型精确收敛至集中式模型；对于分类任务，其（在方向上）收敛至与集中式模型相同的可行解集。我们进一步提出了一种改进的局部梯度下降方法，采用精细化聚合策略，并从理论上证明其在线性分类任务中能沿方向收敛至集中式模型。我们在线性模型中实证验证了理论发现，并在预训练神经网络的分布式微调上进行了实验，以进一步应用我们的理论。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日