Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-β)/η$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+β)/η$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.
翻译:近期研究表明,(随机)梯度下降会在不稳定性边界附近自组织,从而同时影响优化过程与所找到的解。动量与小批量梯度广泛应用于实际深度学习优化中,但它们是否运行在类似的不稳定性机制下尚不明确。我们证明,带有动量的 SGD 展现出类似随机稳定性边界(EoSS)的行为,其依赖批量大小的特性无法通过单一的动量调整稳定性阈值来解释。批量锐度(期望方向上的小批量曲率)在两种不同状态下稳定:在小批量时,它收敛到较低的平台值 $2(1-β)/η$,反映动量放大了随机波动,倾向于比普通 SGD 更平坦的区域;在大批量时,它收敛到较高的平台值 $2(1+β)/η$,此时动量恢复其经典稳定效应,倾向于与全批量动力学一致的更尖锐区域。我们进一步证明这与线性稳定性阈值一致,并讨论了其对超参数调优与耦合的启示。