As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo's behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.
翻译:随着机器学习模型规模不断扩大,数据并行方法固有的频繁同步需求导致显著减速,对进一步扩展构成关键挑战。近期研究提出了一种方法(DiLoCo),可在不损害模型质量的前提下降低同步需求。然而,这些研究并未深入分析DiLoCo的行为如何随模型规模变化。本研究在固定计算预算下训练大语言模型时,系统探究了DiLoCo的扩展定律行为。我们重点分析算法因素(包括模型副本数量、超参数和令牌预算)如何以可通过扩展定律精确预测的方式影响训练过程。研究发现:DiLoCo随模型规模的扩展既具有可预测性又具备鲁棒性。经充分调优后,DiLoCo在模型规模扩展方面优于数据并行训练,即使在小规模模型场景下也能超越数据并行训练。我们的结果揭示了DiLoCo比以往文献记载更广泛的优势,包括:更大的最优批次规模、随规模提升的下游泛化能力改进,以及在固定令牌预算下评估损失的降低。