In this paper, we provide a theoretical study of noise geometry for minibatch stochastic gradient descent (SGD), a phenomenon where noise aligns favorably with the geometry of local landscape. We propose two metrics, derived from analyzing how noise influences the loss and subspace projection dynamics, to quantify the alignment strength. We show that for (over-parameterized) linear models and two-layer nonlinear networks, when measured by these metrics, the alignment can be provably guaranteed under conditions independent of the degree of over-parameterization. To showcase the utility of our noise geometry characterizations, we present a refined analysis of the mechanism by which SGD escapes from sharp minima. We reveal that unlike gradient descent (GD), which escapes along the sharpest directions, SGD tends to escape from flatter directions and cyclical learning rates can exploit this SGD characteristic to navigate more effectively towards flatter regions. Lastly, extensive experiments are provided to support our theoretical findings.
翻译:本文对小批量随机梯度下降(SGD)中的噪声几何现象进行了理论研究,该现象指噪声与局部景观几何形状呈现有利对齐。我们提出两个源自噪声对损失及子空间投影动力学的分析指标,用以量化对齐强度。研究表明,对于(过参数化的)线性模型和两层非线性网络,在这些指标衡量下,该对齐可在与过参数化程度无关的条件下得到可证明的保证。为展示噪声几何特征的实际效用,我们改进了SGD逃离尖锐极小值机制的分析,揭示了与沿最陡方向逃离的梯度下降(GD)不同,SGD倾向于沿平坦方向逃离,而循环学习率可利用该特性更有效地引导模型向平坦区域导航。最后,我们通过大量实验验证理论发现。