We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.
翻译:我们研究了基于动量的优化器在齐次模型中的隐式偏差。首先,我们将齐次模型中关于最速下降法隐式偏差的现有结果推广至具有可选学习率调度的归一化最速下降法。随后,我们证明对于光滑齐次模型,在衰减学习率调度下,动量最速下降算法(如Muon(谱范数)、MomentumGD($\ell_2$范数)和Signum($\ell_\infty$范数))近似于最速下降轨迹,从而证实这些算法同样对相应间隔最大化问题的KKT点具有偏好性。我们将分析进一步扩展至Adam(不含稳定性常数),该算法最大化$\ell_\infty$间隔;以及Muon-Signum和Muon-Adam,它们最大化混合范数间隔。实验验证了理论结果,并表明所最大化的间隔特性取决于优化器的选择。总体而言,我们的研究拓展了先前关于齐次模型中最速下降法以及线性模型中基于动量的优化器的工作脉络。