Muon has recently emerged as a strong alternative to AdamW for training neural networks, with encouraging large-scale pretraining results and growing evidence that matrix-structured updates can be faster in practice. Yet Muon, and more generally Linear Minimization Oracle (LMO) based methods, are typically used synchronously. This is problematic in heterogeneous distributed systems, where workers complete gradient computations at different speeds and synchronous training must repeatedly wait for slower workers. In this work, we introduce Ringmaster LMO, an asynchronous LMO-based momentum method for unconstrained stochastic nonconvex optimization. Our method builds on the delay-thresholding idea of Ringmaster ASGD. For SGD-type methods, Ringmaster ASGD achieves optimal time complexity by discarding overly stale gradients. Ringmaster LMO extends this mechanism to general LMO-based updates. We establish convergence guarantees under generalized $(L_0, L_1)$-smoothness and further develop a parameter-agnostic variant with decreasing stepsizes and adaptive delay thresholds. Finally, we translate our iteration guarantees into time complexity bounds under heterogeneous worker computation times. In the classical Euclidean smooth setting, these bounds recover the optimal time complexity of Ringmaster ASGD. Experiments on stochastic quadratic problems and NanoChat language-model pretraining show that the advantages of Ringmaster LMO grow with system heterogeneity and that the method outperforms strong synchronous and asynchronous baselines.
翻译:Muon 近期成为训练神经网络时 AdamW 的有力替代方案,在大型预训练任务中展现出令人鼓舞的结果,且有越来越多证据表明矩阵结构更新在实践中可更快完成。然而,Muon 及更广泛的基于线性最小化预言(LMO)的方法通常以同步方式使用。这在异构分布式系统中存在问题——由于各工作节点完成梯度计算的速度不同,同步训练必须反复等待较慢的工作节点。本文提出 Ringmaster LMO,一种用于无约束随机非凸优化的基于 LMO 的异步动量方法。该方法基于 Ringmaster ASGD 的延迟阈值思想构建。对于 SGD 类方法,Ringmaster ASGD 通过丢弃过于过时的梯度实现了最优时间复杂度。Ringmaster LMO 将该机制推广至基于 LMO 的通用更新。我们在广义 $(L_0, L_1)$-光滑性假设下建立了收敛保证,并进一步开发了具有递减步长与自适应延迟阈值的参数无关变体。最后,我们将迭代保证转化为异构工作节点计算时间下的时间复杂度界。在经典欧几里得光滑设定下,这些界恢复了 Ringmaster ASGD 的最优时间复杂度。在随机二次问题及 NanoChat 语言模型预训练上的实验表明,Ringmaster LMO 的优势随系统异构性增大而增强,且该方法的性能优于强同步与异步基线方法。