Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected accelerators, with devices exchanging gradients and other intermediate states at each optimization step. While it is difficult to build and maintain a single computing cluster hosting many accelerators, it might be easier to find several computing clusters each hosting a smaller number of devices. In this work, we propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of language models on islands of devices that are poorly connected. The approach is a variant of federated averaging, where the number of inner steps is large, the inner optimizer is AdamW, and the outer optimizer is Nesterov momentum. On the widely used C4 dataset, we show that DiLoCo on 8 workers performs as well as fully synchronous optimization while communicating 500 times less. DiLoCo exhibits great robustness to the data distribution of each worker. It is also robust to resources becoming unavailable over time, and vice versa, it can seamlessly leverage resources that become available during training.
翻译:摘要:大型语言模型(LLM)已成为机器学习诸多应用中的关键组件。然而,标准LLM训练方法需要大量紧密互联的加速器,并在每次优化步骤中交换梯度及其他中间状态。尽管构建和维护一个承载众多加速器的单一计算集群存在困难,但更容易找到若干计算集群,每个集群搭载较少设备。本研究提出分布式优化算法——分布式低通信(DiLoCo),该算法能够在连接薄弱的设备集群上训练语言模型。本方法采用联邦平均的变体,其内层优化步数较大,内层优化器为AdamW,外层优化器为Nesterov动量。在广泛使用的C4数据集上,我们证明采用8个工作节点的DiLoCo可实现与完全同步优化相当的性能,同时通信量降低500倍。DiLoCo对各个工作节点的数据分布具有极强的鲁棒性,而且能适应训练过程中计算资源的动态变化——既能在资源不可用时维持稳定性,也能无缝利用训练期间新增的计算资源。