Spectral gradient methods, such as the Muon optimizer, modify gradient updates by preserving directional information while discarding scale, and have shown strong empirical performance in deep learning. We investigate the mechanisms underlying these gains through a dynamical analysis of a nonlinear phase retrieval model with anisotropic Gaussian inputs, equivalent to training a two-layer neural network with the quadratic activation and fixed second-layer weights. Focusing on a spiked covariance setting where the dominant variance direction is orthogonal to the signal, we show that gradient descent (GD) suffers from a variance-induced misalignment: during the early escaping stage, the high-variance but uninformative spike direction is multiplicatively amplified, degrading alignment with the true signal under strong anisotropy. In contrast, spectral gradient descent (SpecGD) removes this spike amplification effect, leading to stable alignment and accelerated noise contraction. Numerical experiments confirm the theory and show that these phenomena persist under broader anisotropic covariances.
翻译:谱梯度方法(如Muon优化器)通过保留方向信息而舍弃尺度信息来修正梯度更新,在深度学习中展现出优异的实证性能。本文通过分析具有各向异性高斯输入的非线性相位恢复模型(等效于训练具有二次激活函数和固定第二层权重的两层神经网络)的动力学机制,探究这些性能增益的内在原理。聚焦于主方差方向与信号正交的尖峰协方差设定,我们证明梯度下降(GD)存在方差诱导的失准问题:在早期逃逸阶段,高方差但无信息的尖峰方向被乘性放大,在强各向异性下会降低与真实信号的对准性。相比之下,谱梯度下降(SpecGD)消除了这种尖峰放大效应,从而实现稳定的对准并加速噪声收缩。数值实验验证了理论结果,并表明这些现象在更广泛的各向异性协方差条件下仍然存在。