The ever-growing scale of deep learning models and training data underscores the critical importance of efficient optimization methods. While preconditioned gradient methods such as Adam and AdamW are the de facto optimizers for training neural networks and large language models, structure-aware preconditioned optimizers like Shampoo and Muon, which utilize the matrix structure of gradients, have demonstrated promising evidence of faster convergence. In this paper, we introduce a unifying framework for analyzing "matrix-aware" preconditioned methods, which not only sheds light on the effectiveness of Muon and related optimizers but also leads to a class of new structure-aware preconditioned methods. A key contribution of this framework is its precise distinction between preconditioning strategies that treat neural network weights as vectors (addressing curvature anisotropy) versus those that consider their matrix structure (addressing gradient anisotropy). This perspective provides new insights into several empirical phenomena in language model pre-training, including Adam's training instabilities, Muon's accelerated convergence, and the necessity of learning rate warmup for Adam. Building upon this framework, we introduce PolarGrad, a new class of preconditioned optimization methods based on the polar decomposition of matrix-valued gradients. As a special instance, PolarGrad includes Muon with updates scaled by the nuclear norm of the gradients. We provide numerical implementations of these methods, leveraging efficient numerical polar decomposition algorithms for enhanced convergence. Our extensive evaluations across diverse matrix optimization problems and language model pre-training tasks demonstrate that PolarGrad outperforms both Adam and Muon.
翻译:深度学习模型与训练数据规模的持续增长,凸显了高效优化方法的重要性。虽然Adam和AdamW等预处理梯度方法已成为训练神经网络和大语言模型的事实标准优化器,但利用梯度矩阵结构的结构感知预处理优化器(如Shampoo和Muon)已展现出更快收敛的有力证据。本文提出一个用于分析“矩阵感知”预处理方法的统一框架,该框架不仅揭示了Muon及相关优化器的有效性,还导出了一类新的结构感知预处理方法。该框架的一个关键贡献在于精确区分了两种预处理策略:一种将神经网络权重视为向量(处理曲率各向异性),另一种则考虑其矩阵结构(处理梯度各向异性)。这一视角为语言模型预训练中的若干经验现象提供了新见解,包括Adam的训练不稳定性、Muon的加速收敛特性,以及Adam所需学习率预热机制的必要性。基于此框架,我们提出了PolarGrad——一类基于矩阵值梯度极分解的新型预处理优化方法。作为特例,PolarGrad包含了以梯度核范数缩放更新步长的Muon方法。我们给出了这些方法的数值实现,利用高效的数值极分解算法以提升收敛速度。在多种矩阵优化问题及语言模型预训练任务上的广泛实验表明,PolarGrad在性能上均优于Adam与Muon。