Training large neural networks typically requires sharing gradients between accelerators through specialized high-speed interconnects. Drawing from the signal processing principles of frequency decomposition and energy compaction, we demonstrate that synchronizing full optimizer states and model parameters during training is unnecessary. By decoupling momentum updates and allowing controlled divergence in optimizer states across accelerators, we achieve improved convergence compared to state-of-the-art optimizers. We introduce {\textbf{De}}coupled {\textbf{Mo}}mentum (DeMo), a fused optimizer and data parallel algorithm that reduces inter-accelerator communication requirements by several orders of magnitude. This enables training of large neural networks even with limited network bandwidth and heterogeneous hardware. Our method is topology-agnostic and architecture-independent and supports scalable clock-synchronous distributed training with negligible compute and memory overhead. Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW, while eliminating the need for high-speed interconnects when pre-training large scale foundation models. An open source reference PyTorch implementation is published on GitHub at https://github.com/bloc97/DeMo
翻译:训练大型神经网络通常需要通过专用高速互连在加速器之间共享梯度。借鉴频率分解与能量压缩的信号处理原理,我们证明在训练期间同步完整的优化器状态和模型参数并非必要。通过解耦动量更新并允许跨加速器的优化器状态存在受控差异,我们实现了相较于现有最先进优化器更优的收敛性能。本文提出解耦动量优化算法,这是一种融合优化器与数据并行的算法,可将加速器间通信需求降低数个数量级。该方法使得即使在有限网络带宽和异构硬件条件下也能训练大型神经网络。我们的方法具有拓扑无关性与架构独立性,支持可扩展的时钟同步分布式训练,且计算与内存开销可忽略不计。实验结果表明,使用DeMo训练的模型在性能上达到或超过使用AdamW训练的同等模型,同时在大规模基础模型预训练中消除了对高速互连的需求。开源PyTorch参考实现已发布于GitHub:https://github.com/bloc97/DeMo