Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). Recent benchmark studies of LLM pretraining optimizers have demonstrated that variance-reduction techniques such as MARS can substantially speed up training compared with standard optimizers that do not employ variance reduction. In this paper, we introduce MARS-M, a new optimizer that integrates MARS-style variance reduction with Muon. Under standard regularity conditions, we prove that MARS-M converges to a first-order stationary point at a rate of $\tilde{\mathcal{O}}(T^{-1/3})$, improving upon the $\tilde{\mathcal{O}}(T^{-1/4})$ rate attained by Muon. Empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/tree/main/MARS_M.
翻译:基于矩阵的预条件优化器(如Muon)最近被证明在训练大规模神经网络(包括大语言模型)时比基于标量的优化器更高效。近期关于大语言模型预训练优化器的基准研究表明,与未采用方差缩减的标准优化器相比,MARS等方差缩减技术能显著加速训练。本文提出MARS-M,这是一种将MARS式方差缩减与Muon相结合的新型优化器。在标准正则性条件下,我们证明MARS-M以$\tilde{\mathcal{O}}(T^{-1/3})$的速率收敛至一阶稳定点,优于Muon所达到的$\tilde{\mathcal{O}}(T^{-1/4})$速率。在语言建模和计算机视觉任务上的实证结果表明,MARS-M在不同下游基准测试中持续产生更低的损失和更好的性能。MARS-M的实现代码发布于https://github.com/AGI-Arena/MARS/tree/main/MARS_M。