MuLoCo：Muon作为DiLoCo实用内部优化器 (MuLoCo: Muon is a practical inner optimizer for DiLoCo)

DiLoCo is a powerful framework for training large language models (LLMs), enabling larger optimal batch sizes and increased accelerator utilization under networking constraints. However, DiLoCo's performance has been shown to degrade as the number of workers (K) increases (Charles et al., 2025). In this work, we posit that a related but often overlooked factor in DiLoCo's behavior is the choice of inner optimizer, which shapes the pseudogradient used by the outer optimizer. Given the recent success of Muon relative to AdamW for data parallel (DP) training, we examine how Muon's normalized optimizer steps can affect the pseudogradient's quality. We find that, relative to AdamW, Muon yields more directionally correct pseudogradients as the number of workers (K) increases. In our experiments pre-training language models, we conduct extensive hyperparameter tuning across 150M, 416M, 914M, 1.76B, and 3.1B models for DiLoCo, MuLoCo, AdamW DP, and Muon DP. Consistently across all scales, we find that with K>=1 workers, MuLoCo (Muon inner optimizer DiLoCo) achieves superior performance to DiLoCo in absolute terms and for K>2 it outperforms DiLoCo relative to their data parallel baselines, while being compatible with quantization, streaming, and long synchronization intervals. At K=1, we find that MuLoCo can even outperform the data-parallel gold standard while having larger critical batch sizes. Finally, we extrapolate optimal hyperparameters to 15B scale and train a model with each method (six in total) using K=1 and K=16 workers. We find that K=16 MuLoCo nearly matches single-worker performance at this scale, while MuLoCo K=1 matches the best performing baseline while using a much larger 16M token batch size.

翻译：DiLoCo是一种用于训练大语言模型（LLM）的强大框架，能够在网络约束下实现更大的最优批次规模并提高加速器利用率。然而，研究表明随着工作节点数量（K）增加，DiLoCo的性能会下降（Charles等人，2025年）。本工作中，我们指出影响DiLoCo行为的一个相关但常被忽视的因素是内部优化器的选择，它决定了外部优化器所使用的伪梯度方向。鉴于Muon相较于AdamW在数据并行（DP）训练中取得的近期成功，我们研究了Muon的归一化优化步骤如何影响伪梯度质量。研究发现，相对于AdamW，随着工作节点数量（K）增加，Muon能产生方向更准确的伪梯度。在语言模型预训练实验中，我们对DiLoCo、MuLoCo、AdamW DP和Muon DP方法在1.5亿、4.16亿、9.14亿、17.6亿和31亿参数规模的模型上进行了全面的超参数调优。所有规模实验一致表明：当K≥1时，MuLoCo（采用Muon内部优化器的DiLoCo）在绝对性能上优于DiLoCo；当K>2时，相较于各自的数据并行基线，MuLoCo相对性能也超过DiLoCo，同时保持与量化、流式处理和长同步间隔的兼容性。在K=1时，MuLoCo甚至能在保持更大临界批次规模的同时超越数据并行的黄金标准。最后，我们将最优超参数外推至150亿参数规模，并使用K=1和K=16工作节点分别训练了六组模型。实验发现：在此规模下，K=16的MuLoCo几乎达到单工作节点性能水平，而K=1的MuLoCo在使用更大1600万词元批次规模的同时，仍能匹配最佳基线性能。