The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: https://github.com/K1seki221/MuonPlus.
翻译:Muon优化器通过梯度(或动量)正交化在预训练大语言模型方面展现出良好性能。本文提出一种对Muon简单而有效的增强方法——Muon+,其在正交化后引入额外的归一化步骤。我们通过跨多种模型规模和架构的广泛预训练实验验证了Muon+的有效性。评估涵盖参数规模从1.3亿至7.74亿的GPT风格模型,以及参数规模从6000万至10亿的LLaMA风格模型。我们全面评估了Muon+在计算最优训练机制下的有效性,并将词元-参数比(T2P)进一步扩展至工业级水平(约200)。实验结果表明,相较于Muon,Muon+在训练和验证困惑度上均能带来持续提升。代码开源地址:https://github.com/K1seki221/MuonPlus。