This paper investigates scaling laws for local SGD in LLM training, a distributed optimization algorithm that facilitates training on loosely connected devices. Through extensive experiments, we show that local SGD achieves competitive results compared to conventional methods, given equivalent model parameters, datasets, and computational resources. Furthermore, we explore the application of local SGD in various practical scenarios, including multi-cluster setups and edge computing environments. Our findings elucidate the necessary conditions for effective multi-cluster LLM training and examine the potential and limitations of leveraging edge computing resources in the LLM training process. This demonstrates its viability as an alternative to single large-cluster training.
翻译:本文研究大规模语言模型训练中本地SGD的缩放定律,这是一种适用于松散连接设备的分布式优化算法。通过大量实验,我们证明在模型参数量、数据集和计算资源相同的情况下,本地SGD相比传统方法能获得具有竞争力的结果。此外,我们探索了本地SGD在多集群配置和边缘计算环境等多种实际场景中的应用。我们的研究阐明了多集群LLM训练有效开展的必要条件,并探讨了在LLM训练过程中利用边缘计算资源的潜力与局限性。这表明本地SGD可作为单一大型集群训练的有效替代方案。