Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of {\it asynchronous} Local-SGD for training language models; that is, each worker updates the global parameters as soon as it has finished its SGD steps. We conduct a comprehensive investigation by examining how worker hardware heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We find that with naive implementations, asynchronous Local-SGD takes more iterations to converge than its synchronous counterpart despite updating the (global) model parameters more frequently. We identify momentum acceleration on the global parameters when worker gradients are stale as a key challenge. We propose a novel method that utilizes a delayed Nesterov momentum update and adjusts the workers' local training steps based on their computation speed. This approach, evaluated with models up to 150M parameters on the C4 dataset, matches the performance of synchronous Local-SGD in terms of perplexity per update step, and significantly surpasses it in terms of wall clock time.
翻译:本地随机梯度下降(Local-SGD),也称为联邦平均,是一种分布式优化方法,其中每个设备在每次通信中执行多于一次SGD更新。本文对用于训练语言模型的**异步**本地SGD进行了实证研究;即,每个工作节点在完成其SGD步骤后立即更新全局参数。我们通过考察工作节点硬件异构性、模型大小、工作节点数量和优化器如何影响学习性能,进行了全面调查。我们发现,在朴素实现下,异步本地SGD尽管更频繁地更新全局模型参数,但相比其同步版本需要更多迭代才能收敛。我们将工作节点梯度过时情况下的全局参数动量加速识别为一个关键挑战。我们提出了一种新方法,该方法利用延迟的涅斯特罗夫动量更新,并根据工作节点的计算速度调整其本地训练步骤。该方法在C4数据集上使用高达1.5亿参数的模型进行了评估,在每次更新步骤的困惑度方面与同步本地SGD性能相当,并在墙钟时间上显著超越它。