Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer

We develop a gradient flow on the space of probability measures defined on matrix-valued parameters induced by regularized Muon, an analytically smoothed version of the idealized Muon optimizer. The key observation is that the regularized orthogonalization map is the gradient of a smooth Fenchel-dual smoothing of the nuclear norm. This identifies the (regularized) Muon update as a mirror/prox step in the update variable, with momentum acting as the dual coordinate. We use this structure to lift Muon from a single matrix parameter to finite-particle probability objectives of the form $J(ρ)=R\left(\int F d ρ\right)$, a setting motivated by mean-field descriptions of neural-network training, and derive the inertial continuous-time limit. Using this structure, we derive the finite-particle continuous-time limit under the inertial scaling of step size and momentum, and then pass to a phase-space mean-field equation over probability laws on parameter-momentum pairs. The resulting flow can be shown to be a damped Hamiltonian probability dynamics whose kinetic energy is induced by the regularized Muon mirror potential. We prove an exact Hamiltonian dissipation identity, showing that the Hamiltonian energy decreases monotonically. While the target objective itself need not be monotone along the inertial Muon dynamics, under additional gradient-dominance, bounded-momentum, and curvature/alignment assumptions, we obtain continuous and discrete-time exponential convergence rates for the objective gap. We also study the well-posedness of the mean-field limit equation and establish propagation of chaos guarantees for the interacting particle system. Finally, we extend the formulation to Hilbert-valued feature maps on product matrix spaces, yielding a blockwise Muon probability flow applicable to smooth transformer mixture-of-experts models.

翻译：我们针对由正则化Muon（理想化Muon优化器的一种解析平滑版本）诱导的矩阵值参数空间上的概率测度，发展了一种梯度流。关键发现是：正则化正交映射是核范数光滑Fenchel对偶光滑化的梯度。这揭示了(正则化)Muon更新实质上是更新变量上的镜像/近端步，动量作为对偶坐标。利用该结构，我们将Muon从单矩阵参数提升至形式为$J(ρ)=R\left(\int F d ρ\right)$的有限粒子概率目标（该设定受神经网络训练的均场描述启发），并推导了惯性连续时间极限。基于该结构，我们推导了步长与动量惯性标度下的有限粒子连续时间极限，进而过渡至参数-动量对概率律上的相空间均场方程。所得流可视为阻尼哈密顿概率动力学，其动能由正则化Muon镜像势诱导。我们证明了精确的哈密顿耗散恒等式，表明哈密顿能量单调递减。尽管目标函数本身沿惯性Muon动力学未必单调，但在额外梯度优势、有界动量及曲率/对齐假设下，我们获得了目标间隙的连续与离散时间指数收敛率。我们还研究了均场极限方程的良好适定性，并建立了相互作用粒子系统的混沌传播保证。最终，我们将该框架推广至积矩阵空间上的Hilbert值特征映射，得到适用于平滑Transformer混合专家模型的块状Muon概率流。