Asynchronous Local-SGD Training for Language Modeling

Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of {\it asynchronous} Local-SGD for training language models; that is, each worker updates the global parameters as soon as it has finished its SGD steps. We conduct a comprehensive investigation by examining how worker hardware heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We find that with naive implementations, asynchronous Local-SGD takes more iterations to converge than its synchronous counterpart despite updating the (global) model parameters more frequently. We identify momentum acceleration on the global parameters when worker gradients are stale as a key challenge. We propose a novel method that utilizes a delayed Nesterov momentum update and adjusts the workers' local training steps based on their computation speed. This approach, evaluated with models up to 150M parameters on the C4 dataset, matches the performance of synchronous Local-SGD in terms of perplexity per update step, and significantly surpasses it in terms of wall clock time.

翻译：本地随机梯度下降（Local-SGD），也称为联邦平均，是一种分布式优化方法，其中每个设备在每次通信中执行多于一次SGD更新。本文对用于训练语言模型的**异步**本地SGD进行了实证研究；即，每个工作节点在完成其SGD步骤后立即更新全局参数。我们通过考察工作节点硬件异构性、模型大小、工作节点数量和优化器如何影响学习性能，进行了全面调查。我们发现，在朴素实现下，异步本地SGD尽管更频繁地更新全局模型参数，但相比其同步版本需要更多迭代才能收敛。我们将工作节点梯度过时情况下的全局参数动量加速识别为一个关键挑战。我们提出了一种新方法，该方法利用延迟的涅斯特罗夫动量更新，并根据工作节点的计算速度调整其本地训练步骤。该方法在C4数据集上使用高达1.5亿参数的模型进行了评估，在每次更新步骤的困惑度方面与同步本地SGD性能相当，并在墙钟时间上显著超越它。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

37+阅读 · 2019年10月17日