The dynamical stability of optimization methods at the vicinity of minima of the loss has recently attracted significant attention. For gradient descent (GD), stable convergence is possible only to minima that are sufficiently flat w.r.t. the step size, and those have been linked with favorable properties of the trained model. However, while the stability threshold of GD is well-known, to date, no explicit expression has been derived for the exact threshold of stochastic GD (SGD). In this paper, we derive such a closed-form expression. Specifically, we provide an explicit condition on the step size $\eta$ that is both necessary and sufficient for the stability of SGD in the mean square sense. Our analysis sheds light on the precise role of the batch size $B$. Particularly, we show that the stability threshold is a monotonically non-decreasing function of the batch size, which means that reducing the batch size can only hurt stability. Furthermore, we show that SGD's stability threshold is equivalent to that of a process which takes in each iteration a full batch gradient step w.p. $1-p$, and a single sample gradient step w.p. $p$, where $p \approx 1/B $. This indicates that even with moderate batch sizes, SGD's stability threshold is very close to that of GD's. Finally, we prove simple necessary conditions for stability, which depend on the batch size, and are easier to compute than the precise threshold. We demonstrate our theoretical findings through experiments on the MNIST dataset.
翻译:优化方法在损失函数最小值附近的动力学稳定性近期引起了广泛关注。对于梯度下降法,稳定收敛仅可能发生在相对于步长足够平坦的最小值处,而这些最小值与训练模型的有利特性相关联。然而,尽管梯度下降法的稳定性阈值广为人知,但至今尚未推导出随机梯度下降法精确阈值的显式表达式。本文推导了这样一个闭式表达式。具体而言,我们给出了步长η的显式条件,该条件是随机梯度下降法在均方意义下稳定性的充分必要条件。我们的分析阐明了批大小B的精确作用。特别地,我们证明稳定性阈值是批大小的单调非递减函数,这意味着减小批大小只会损害稳定性。此外,我们证明随机梯度下降法的稳定性阈值等价于这样一个过程的稳定性阈值:该过程在每次迭代中以概率1-p执行全批梯度步,以概率p执行单样本梯度步,其中p≈1/B。这表明即使采用中等批大小,随机梯度下降法的稳定性阈值也非常接近梯度下降法的阈值。最后,我们证明了依赖于批大小的简单稳定性必要条件,这些条件比精确阈值更易于计算。我们通过在MNIST数据集上的实验验证了理论发现。