The rapid development of large language models (LLMs) has driven the demand for more efficient optimization techniques. Among these, the Lookahead family of optimizers employs a two-loop framework, maintaining fast and slow sets of model weights. Multiple inner optimizer steps on the fast weights produce a trajectory - the pseudo-gradient - that is used to update the slow weights. DiLoCo, a notable example originally designed for distributed training, applies Nesterov momentum to the averaged pseudo-gradient from multiple workers, claiming to even outperform AdamW in a non-distributed setup. In this paper, we empirically show that DiLoCo's surprising effectiveness stems primarily from applying Nesterov momentum to the pseudo-gradient, which improves training in a non-distributed setting. We call this Lookahead variant the Step-$K$ Nesterov Outer Optimizer (SNOO). We demonstrate that SNOO achieves compute factor gains of 1.5 - 2.5$\times$ in a non-distributed setting up to a scale of 1e23 training FLOPs, with improvements that increase with model size. Because of its minimal compute and memory overhead and compatibility with model sharding, SNOO is a practical enhancement for a variety of inner optimizers, including AdamW and Muon.
翻译:大型语言模型(LLMs)的快速发展推动了对更高效优化技术的需求。其中,Lookahead系列优化器采用双循环框架,维护快速和慢速两组模型权重。对快速权重执行多步内层优化器更新形成轨迹——即伪梯度——用于更新慢速权重。DiLoCo作为最初为分布式训练设计的典型范例,将涅斯捷罗夫动量应用于来自多个工作节点的平均伪梯度,其声称在非分布式设置中甚至能超越AdamW。本文通过实验证明,DiLoCo的惊人有效性主要源于对伪梯度应用涅斯捷罗夫动量,从而改善了非分布式环境下的训练效果。我们将此Lookahead变体称为步长$K$涅斯捷罗夫外层优化器(SNOO)。实验表明,在训练计算量达1e23 FLOPs的非分布式场景中,SNOO可实现1.5-2.5$\times$的计算效率提升,且改进效果随模型规模增大而增强。由于其极低的计算与内存开销以及与模型分片的兼容性,SNOO可成为包括AdamW和Muon在内的多种内层优化器的实用增强方案。