Empirical studies have demonstrated that the noise in stochastic gradient descent (SGD) aligns favorably with the local geometry of loss landscape. However, theoretical and quantitative explanations for this phenomenon remain sparse. In this paper, we offer a comprehensive theoretical investigation into the aforementioned {\em noise geometry} for over-parameterized linear (OLMs) models and two-layer neural networks. We scrutinize both average and directional alignments, paying special attention to how factors like sample size and input data degeneracy affect the alignment strength. As a specific application, we leverage our noise geometry characterizations to study how SGD escapes from sharp minima, revealing that the escape direction has significant components along flat directions. This is in stark contrast to GD, which escapes only along the sharpest directions. To substantiate our theoretical findings, both synthetic and real-world experiments are provided.
翻译:实证研究表明,随机梯度下降中的噪声与损失曲面的局部几何呈现出有利的对齐特性。然而,关于这一现象的定量与理论解释仍较为匮乏。本文针对过参数化线性模型与两层神经网络,对上述"噪声几何"提出了全面的理论探究。我们仔细分析了平均对齐与方向对齐,特别关注样本规模与输入数据退化等因素如何影响对齐强度。作为具体应用,我们利用噪声几何刻画来研究SGD如何逃离尖锐最小值,揭示出逃逸方向在平坦方向上具有显著分量——这与仅沿最尖锐方向逃逸的梯度下降形成鲜明对比。为支撑理论发现,我们提供了合成数据与真实实验的验证。