The Muon optimizer has demonstrated strong empirical performance in pre-training large language models by performing matrix-level gradient (or momentum) orthogonalization in each layer independently. In this work, we propose TEON, a principled generalization of Muon that extends orthogonalization beyond individual layers by modeling the gradients of a neural network as a structured higher-order tensor. We present TEON's improved convergence guarantee over layer-wise Muon, and further develop a practical instantiation of TEON based on the theoretical analysis with corresponding ablation. We evaluate our approach on two widely adopted architectures: GPT-style models, ranging from 130M to 774M parameters, and LLaMA-style models, ranging from 60M to 1B parameters. Experimental results show that TEON consistently improves training and validation perplexity across model scales and exhibits strong robustness under various approximate SVD schemes.
翻译:Muon优化器通过在每一层独立执行矩阵级梯度(或动量)正交化,在预训练大规模语言模型方面展现出强大的实证性能。本文提出TEON,它是Muon的一种原理性推广,通过将神经网络的梯度建模为结构化高阶张量,将正交化过程扩展到单个层之外。我们给出了TEON相较于逐层Muon的改进收敛性保证,并基于理论分析及相应的消融实验,进一步开发了TEON的一个实用实例。我们在两种广泛采用的架构上评估了我们的方法:参数规模从1.3亿到7.74亿的GPT风格模型,以及参数规模从6000万到10亿的LLaMA风格模型。实验结果表明,TEON在不同模型规模上均能持续改善训练和验证困惑度,并在各种近似奇异值分解方案下表现出很强的鲁棒性。