Gradient normalization is central in deep-learning optimization because it stabilizes training and reduces sensitivity to scale. For deep architectures, parameters are naturally grouped into matrices or blocks, so spectral normalizations are often more faithful than coordinatewise Euclidean ones; Muon is the main motivating example of this paper. More broadly, we study a family of spectral normalization rules, ranging from ordinary gradient descent to Muon and intermediate Schatten-type schemes, in a mean-field regime where parameters are modeled by probability measures. We introduce a family of Spectral Wasserstein distances indexed by a norm gamma on positive semidefinite matrices. The trace norm recovers the classical quadratic Wasserstein distance, the operator norm recovers the Muon geometry, and intermediate Schatten norms interpolate between them. We develop the static Kantorovich formulation, prove comparison bounds with W2, derive a max-min representation, and obtain a conditional Brenier theorem. For Gaussian marginals, the problem reduces to a constrained optimization on covariance matrices, extending the Bures formula and yielding a closed form for commuting covariances in the Schatten family. For monotone norms, including all Schatten cases, we prove the equivalence between the static and dynamic Benamou-Brenier formulations, deduce that the resulting transport cost is a genuine metric equivalent to W2 in fixed dimension, and show that the induced Gaussian covariance cost is also a metric. We then interpret the associated normalized continuity equation as a Spectral Wasserstein gradient flow, identify its exact finite-particle counterpart as a normalized matrix flow, obtain first geodesic-convexity results, and show how positively homogeneous mean-field models induce a spectral unbalanced transport on the sphere.
翻译:梯度归一化在深度学习优化中至关重要,因为它能稳定训练并降低对尺度的敏感性。对于深层架构,参数自然以矩阵或块的形式分组,因此谱归一化通常比坐标层面的欧几里得归一化更可靠;Muon是本文的主要激励示例。更广泛地,我们在均值场框架下研究一族谱归一化规则(涵盖普通梯度下降、Muon以及中间Schatten型方案),其中参数由概率测度建模。我们引入一族以正半定矩阵上的范数γ为索引的谱Wasserstein距离。迹范数恢复经典的二次Wasserstein距离,算子范数恢复Muon几何,而中间Schatten范数则在其间插值。我们发展了静态Kantorovich公式,推导了与W2的比较界,得到了最大-最小表示,并获得了条件Brenier定理。对于高斯边际,问题简化为协方差矩阵上的约束优化,扩展了Bures公式,并为Schatten族中可交换协方差给出了闭式解。对于单调范数(包括所有Schatten情形),我们证明了静态与动态Benamou-Brenier公式的等价性,推导出所得运输代价在固定维度下是等价于W2的严格度量,并证明了诱导的高斯协方差代价同样是度量。进而,我们将关联的归一化连续性方程解释为谱Wasserstein梯度流,将其精确有限粒子对应物识别为归一化矩阵流,获得了首个测地凸性结果,并展示了正齐次均值场模型如何在球面上诱导谱非平衡运输。