OpenDiLoCo is an open-source implementation and replication of the Distributed Low-Communication (DiLoCo) training method for large language models. We provide a reproducible implementation of the DiLoCo experiments, offering it within a scalable, decentralized training framework using the Hivemind library. We demonstrate its effectiveness by training a model across two continents and three countries, while maintaining 90-95% compute utilization. Additionally, we conduct ablations studies focusing on the algorithm's compute efficiency, scalability in the number of workers and show that its gradients can be all-reduced using FP16 without any performance degradation. Furthermore, we scale OpenDiLoCo to 3x the size of the original work, demonstrating its effectiveness for billion parameter models.
翻译:OpenDiLoCo 是针对大语言模型的分布式低通信(DiLoCo)训练方法的开源实现与复现。我们提供了 DiLoCo 实验的可复现实现,并将其集成于基于 Hivemind 库的可扩展去中心化训练框架中。我们通过在两大洲、三个国家间进行模型训练,同时保持 90-95% 的计算利用率,验证了该框架的有效性。此外,我们开展了消融实验,重点分析算法的计算效率、随工作节点数量扩展的 scalability,并证明其梯度可使用 FP16 精度进行全归约操作而不会导致性能下降。进一步地,我们将 OpenDiLoCo 扩展至原始工作 3 倍的规模,证明了其在十亿参数模型上的有效性。