This article examines the implicit regularization effect of Stochastic Gradient Descent (SGD). We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks. We analyze this algorithm in a more realistic regime than typically considered in theoretical works on SGD, as, e.g., we allow the product of the learning rate and Hessian to be $O(1)$ and we do not specify any model architecture, learning task, or loss (objective) function. Our core theoretical result is that optimizing with SGD without replacement is locally equivalent to making an additional step on a novel regularizer. This implies that the expected trajectories of SGD without replacement can be decoupled in (i) following SGD with replacement (in which batches are sampled i.i.d.) along the directions of high curvature, and (ii) regularizing the trace of the noise covariance along the flat ones. As a consequence, SGD without replacement travels flat areas and may escape saddles significantly faster than SGD with replacement. On several vision tasks, the novel regularizer penalizes a weighted trace of the Fisher Matrix, thus encouraging sparsity in the spectrum of the Hessian of the loss in line with empirical observations from prior work. We also propose an explanation for why SGD does not train at the edge of stability (as opposed to GD).
翻译:本文探讨了随机梯度下降(SGD)的隐式正则化效应。我们重点研究无放回SGD这一通常用于优化大规模神经网络的变体。与SGD理论研究中通常考虑的理想化场景不同,我们在更贴近实际的情境下分析该算法——例如,允许学习率与海森矩阵的乘积为$O(1)$量级,且不指定模型架构、学习任务或损失函数。我们的核心理论结果表明:采用无放回SGD进行优化时,其局部行为等价于在新正则化项上额外执行一步更新。这意味着无放回SGD的期望轨迹可分解为:(i)沿高曲率方向遵循有放回SGD轨迹(其中批次通过独立同分布采样获取),以及(ii)沿平坦方向对噪声协方差矩阵的迹进行正则化。因此,无放回SGD能更高效地穿越平坦区域并逃离鞍点,显著快于有放回SGD。在多项视觉任务中,该新型正则化项会惩罚费舍尔信息矩阵的加权迹,从而促进损失函数海森矩阵谱的稀疏性——这与先前工作中的经验观察一致。此外,我们还解释了SGD(不同于梯度下降GD)为何不会在稳定性边界处进行训练。