A variety of widely used optimization methods like SignSGD and Muon can be interpreted as instances of steepest descent under different norm-induced geometries. In this work, we study the implicit bias of mini-batch stochastic steepest descent in multi-class classification, characterizing how batch size, momentum, and variance reduction shape the limiting max-margin behavior and convergence rates under general entry-wise and Schatten-$p$ norms. We show that, without momentum, worst-case convergence and successful classification can only be guaranteed with full-batch gradient. In contrast, momentum enables small-batch convergence to an approximate max-margin solution through a batch-momentum trade-off, though it slows convergence. This approach provides fully explicit, dimension-free rates that improve upon prior results. Moreover, we prove that variance reduction can recover the exact full-batch implicit bias for any batch size, albeit at a slower convergence rate. Finally, we further investigate the batch-size-one steepest descent without momentum, and reveal its convergence to a fundamentally different bias via a concrete data example, which reveals a key limitation of purely stochastic updates. Overall, our unified analysis clarifies when stochastic optimization aligns with full-batch behavior, and paves the way for perform deeper explorations of the training behavior of stochastic gradient steepest descent algorithms.
翻译:诸如SignSGD和Muon等多种广泛应用的最优化方法,可被解释为在不同范数诱导几何下的最速下降实例。本文研究多类分类中小批量随机最速下降的隐式偏差,刻画了批量大小、动量项和方差缩减如何塑造极限最大间隔行为及在通用逐项范数和Schatten-$p$范数下的收敛速率。我们证明:无动量时,仅在全批量梯度下才能保证最坏情况收敛与成功分类;与之相反,动量通过批量-动量权衡实现了小批量向近似最大间隔解的收敛,但会减缓收敛速度。该方法提供了完全显式且与维度无关的收敛速率,优于先前结果。此外,我们证明方差缩减能以更慢的收敛速率为代价,恢复任意批量大小下的精确全批量隐式偏差。最后,我们进一步研究无动量单样本最速下降,并通过具体数据实例揭示其收敛至本质上不同偏差的特性,这揭示了纯粹随机更新的关键局限性。总体而言,我们的统一分析阐明了随机优化何时与全批量行为保持一致,并为深入探索随机梯度最速下降算法的训练行为铺平道路。