We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.
翻译:我们研究了基于动量的优化器在齐次模型上的隐式偏差。首先,我们将齐次模型中关于最速下降法隐式偏差的现有结果推广到具有可选学习率调度的归一化最速下降法。随后,我们证明对于光滑齐次模型,在衰减学习率调度下,诸如Muon(谱范数)、MomentumGD($\ell_2$范数)和Signum($\ell_\infty$范数)等动量最速下降算法近似于最速下降轨迹,从而证明这些算法同样偏向于对应间隔最大化问题的KKT点。我们将分析扩展到Adam(不含稳定性常数),它最大化$\ell_\infty$间隔;以及Muon-Signum和Muon-Adam,它们最大化混合范数。实验验证了理论,并表明所最大化的间隔特性取决于优化器的选择。总体而言,我们的研究结果扩展了先前关于齐次模型中最速下降法以及线性模型中基于动量的优化器的工作脉络。