The Implicit Bias of Steepest Descent with Mini-batch Stochastic Gradient

A variety of widely used optimization methods like SignSGD and Muon can be interpreted as instances of steepest descent under different norm-induced geometries. In this work, we study the implicit bias of mini-batch stochastic steepest descent in multi-class classification, characterizing how batch size, momentum, and variance reduction shape the limiting max-margin behavior and convergence rates under general entry-wise and Schatten-$p$ norms. We show that, without momentum, worst-case convergence and successful classification can only be guaranteed with full-batch gradient. In contrast, momentum enables small-batch convergence to an approximate max-margin solution through a batch-momentum trade-off, though it slows convergence. This approach provides fully explicit, dimension-free rates that improve upon prior results. Moreover, we prove that variance reduction can recover the exact full-batch implicit bias for any batch size, albeit at a slower convergence rate. Finally, we further investigate the batch-size-one steepest descent without momentum, and reveal its convergence to a fundamentally different bias via a concrete data example, which reveals a key limitation of purely stochastic updates. Overall, our unified analysis clarifies when stochastic optimization aligns with full-batch behavior, and paves the way for perform deeper explorations of the training behavior of stochastic gradient steepest descent algorithms.

翻译：诸如SignSGD和Muon等多种广泛应用的最优化方法，可被解释为在不同范数诱导几何下的最速下降实例。本文研究多类分类中小批量随机最速下降的隐式偏差，刻画了批量大小、动量项和方差缩减如何塑造极限最大间隔行为及在通用逐项范数和Schatten-$p$范数下的收敛速率。我们证明：无动量时，仅在全批量梯度下才能保证最坏情况收敛与成功分类；与之相反，动量通过批量-动量权衡实现了小批量向近似最大间隔解的收敛，但会减缓收敛速度。该方法提供了完全显式且与维度无关的收敛速率，优于先前结果。此外，我们证明方差缩减能以更慢的收敛速率为代价，恢复任意批量大小下的精确全批量隐式偏差。最后，我们进一步研究无动量单样本最速下降，并通过具体数据实例揭示其收敛至本质上不同偏差的特性，这揭示了纯粹随机更新的关键局限性。总体而言，我们的统一分析阐明了随机优化何时与全批量行为保持一致，并为深入探索随机梯度最速下降算法的训练行为铺平道路。

相关内容

最速下降

关注 0

最速下降法又称为梯度法，是1847 年由著名数学家Cauchy 给出的，它是解析法中最古老的一种，其他解析方法或是它的变形，或是受它的启发而得到的，因此它是最优化方法的基础。作为一种基本的算法，他在最优化方法中占有重要地位。其优点是工作量少，存储变量较少，初始点要求不高;缺点是收敛慢，效率不高，有时达不到最优解。非线性规划研究的对象是非线性函数的数值最优化问题。它的理论和方法渗透到许多方面，特别是在军事、经济、管理、生产过程自动化、工程设计和产品优化设计等方面都有着重要的应用。而最速下降法正是n元函数的无约束非线性规划问题min f (x)的一种重要解析法，研究最速下降法原理及其算法实现对我们有着极其重要的意义

【CVPR2025】在去噪扩散模型中优化最短路径

专知会员服务

16+阅读 · 2025年3月10日

【普林斯顿博士论文】深度学习优化的隐性偏差：数学考察，391页pdf

专知会员服务

29+阅读 · 2024年10月4日

【博士论文】Stein变分梯度下降与基于共识的优化：趋向于收敛分析与泛化，195页pdf

专知会员服务

20+阅读 · 2024年6月2日

【简明书册】(随机)梯度方法的收敛定理手册，68页pdf

专知会员服务

40+阅读 · 2023年1月31日