The dynamical stability of the iterates during training plays a key role in determining the minima obtained by optimization algorithms. For example, stable solutions of gradient descent (GD) correspond to flat minima, which have been associated with favorable features. While prior work often relies on linearization to determine stability, it remains unclear whether linearized dynamics faithfully capture the full nonlinear behavior. Recent work has shown that GD may stably oscillate near a linearly unstable minimum and still converge once the step size decays, indicating that linear analysis can be misleading. In this work, we explicitly study the effect of nonlinear terms. Specifically, we derive an exact criterion for stable oscillations of GD near minima in the multivariate setting. Our condition depends on high-order derivatives, generalizing existing results. Extending the analysis to stochastic gradient descent (SGD), we show that nonlinear dynamics can diverge in expectation even if a single batch is unstable. This implies that stability can be dictated by a single batch that oscillates unstably, rather than an average effect, as linear analysis suggests. Finally, we prove that if all batches are linearly stable, the nonlinear dynamics of SGD are stable in expectation.
翻译:训练过程中迭代点的动力学稳定性对优化算法所获极小值具有关键影响。例如,梯度下降(GD)的稳定解对应平坦极小值,这类解常被认为具有优良特性。现有研究多依赖线性化方法判定稳定性,但线性化动力学能否忠实反映完整非线性行为仍不明确。近期研究表明,GD可能在线性不稳定极小值附近稳定振荡,并在步长衰减后依然收敛,这说明线性分析可能产生误导。本研究通过显式分析非线性项的作用,在多元设定下推导出GD在极小值附近稳定振荡的精确判据。该条件依赖于高阶导数,推广了现有结论。将分析拓展至随机梯度下降(SGD)后,我们发现即使单个批次存在不稳定性,非线性动力学在期望意义上仍可能发散。这表明稳定性可能由单个不稳定振荡的批次决定,而非线性分析所暗示的平均效应。最后我们证明:若所有批次均线性稳定,则SGD的非线性动力学在期望意义下保持稳定。