Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

A growing lesson from neural network optimization is that optimizer design should respect how the model is parametrized. The layerwise input-output structure of neural networks motivates scale-invariant optimizers, such as Muon and Scion, whose updates also support hyperparameter transfer. At the same time, stochastic gradient noise in deep learning is often far from sub-Gaussian and may exhibit heavy tails. These observations have shaped recent algorithmic principles for training neural networks, yet their joint theoretical consequences are underexplored. In particular, it remains unclear what dimension dependence is unavoidable for gradient-based methods given the problem class is defined by input-output norm and under heavy-tailed noise, and whether higher-order smoothness can accelerate training. We study these questions through nonconvex smooth stochastic optimization over $\mathbb R^{m\times n}$ equipped with general norms and under $p^\mathrm{th}$-moment heavy-tailed noise, where the goal is to achieve an $ε$-stationary point in the dual norm. Our first contribution is a dimension-dependent lower bound: when $\frac{\max\{m,n\}}{(\min\{m,n\})^2}$ is large enough, any gradient-based method requires $Ω(\min\{m, n\}ε^{-\frac{3p-2}{p-1}})$ oracles for the problem class defined by the spectral norm, which is a common input-output norm. We prove that a scale-invariant Scion method with the spectral norm can achieve the matching upper bound of $O(\min\{m, n\}ε^{-\frac{3p-2}{p-1}})$. To exploit higher-order smoothness, we propose a transported Scion method and improve the bound to $O(\min\{m, n\}ε^{-\frac{5p-3}{2p-2}})$ when the Hessian is Lipschitz. Finally, we incorporate heuristics into our transported method and evaluate it across multiple architectures and model sizes, demonstrating its flexibility and compatibility with neural network training.

翻译：从神经网络优化中逐渐获得的经验是，优化器的设计应尊重模型的参数化方式。神经网络的逐层输入输出结构催生了Muon和Scion等尺度不变优化器，其更新过程也支持超参数迁移。与此同时，深度学习中的随机梯度噪声通常远非亚高斯分布，可能呈现重尾特征。这些观察结果塑造了近期训练神经网络的算法原理，然而它们共同的理论后果尚待深入探索。具体而言，当问题类别由输入输出模和重尾噪声定义时，梯度方法难以避免何种维度依赖关系，以及高阶光滑性能否加速训练，这些问题仍不清楚。我们通过定义在配备一般模的$\mathbb R^{m\times n}$空间上、且具有$p$阶矩重尾噪声的非凸光滑随机优化问题来研究这些问题，目标是在对偶模下达到$\epsilon$稳定点。我们的第一个贡献是维度相关的下界：当$\frac{\max\{m,n\}}{(\min\{m,n\})^2}$足够大时，对于由谱模（一种常见的输入输出模）定义的问题类别，任何梯度方法都需要$\Omega(\min\{m, n\}\epsilon^{-\frac{3p-2}{p-1}})$次查询。我们证明，采用谱模的尺度不变Scion方法可以达到匹配的上界$O(\min\{m, n\}\epsilon^{-\frac{3p-2}{p-1}})$。为利用高阶光滑性，我们提出了迁移Scion方法，并在Hessian矩阵满足Lipschitz条件时将界改进为$O(\min\{m, n\}\epsilon^{-\frac{5p-3}{2p-2}})$。最后，我们将启发式策略融入所提出的迁移方法，并在多种架构和模型规模上进行评估，展示了该方法与神经网络训练的灵活性和兼容性。