Empirical studies have demonstrated that the noise in stochastic gradient descent (SGD) aligns favorably with the local geometry of loss landscape. However, theoretical and quantitative explanations for this phenomenon remain sparse. In this paper, we offer a comprehensive theoretical investigation into the aforementioned {\em noise geometry} for over-parameterized linear (OLMs) models and two-layer neural networks. We scrutinize both average and directional alignments, paying special attention to how factors like sample size and input data degeneracy affect the alignment strength. As a specific application, we leverage our noise geometry characterizations to study how SGD escapes from sharp minima, revealing that the escape direction has significant components along flat directions. This is in stark contrast to GD, which escapes only along the sharpest directions. To substantiate our theoretical findings, both synthetic and real-world experiments are provided.
翻译:实证研究表明,随机梯度下降中的噪声与损失曲面的局部几何结构呈现有利的对齐特性。然而,针对这一现象的定量与理论解释仍较为匮乏。本文针对过参数化线性模型和双层神经网络,对前述“噪声几何”进行了全面的理论研究。我们深入分析了平均对齐与方向对齐,特别关注样本量和输入数据退化等因素如何影响对齐强度。作为具体应用,我们利用所刻画的噪声几何特性研究了SGD逃离尖锐极小值点的过程,发现其逃逸方向在平坦方向上具有显著分量。这与仅沿最陡方向逃逸的梯度下降形成鲜明对比。为支撑理论发现,我们提供了合成数据与真实实验的验证。