A large body of theory and empirical work hypothesizes a connection between the flatness of a neural network's loss landscape during training and its performance. However, there have been conceptually opposite pieces of evidence regarding when SGD prefers flatter or sharper solutions during training. In this work, we partially but causally clarify the flatness-seeking behavior of SGD by identifying and exactly solving an analytically solvable model that exhibits both flattening and sharpening behavior during training. In this model, the SGD training has no \textit{a priori} preference for flatness, but only a preference for minimal gradient fluctuations. This leads to the insight that, at least within this model, it is data distribution that uniquely determines the sharpness at convergence, and that a flat minimum is preferred if and only if the noise in the labels is isotropic across all output dimensions. When the noise in the labels is anisotropic, the model instead prefers sharpness and can converge to an arbitrarily sharp solution, depending on the imbalance in the noise in the labels spectrum. We reproduce this key insight in controlled settings with different model architectures such as MLP, RNN, and transformers.
翻译:大量理论和实证研究假设神经网络训练过程中损失景观的平坦度与其性能之间存在关联。然而,关于SGD在训练过程中何时偏好平坦解或尖锐解,存在概念上相互矛盾的证据。在本工作中,我们通过识别并精确求解一个在训练过程中同时表现出平坦化和尖锐化行为的解析可解模型,部分但因果性地澄清了SGD的平坦性寻求行为。在该模型中,SGD训练对平坦性没有先验偏好,仅对最小化梯度波动具有偏好。这引出了一个关键见解:至少在该模型内,数据分布唯一地决定了收敛时的尖锐度,且平坦最小值仅在所有输出维度上标签噪声各向同性的情况下被偏好。当标签噪声各向异性时,模型反而偏好尖锐性,并可能收敛到任意尖锐的解,具体取决于标签噪声谱的不平衡程度。我们在受控环境中使用不同模型架构(如MLP、RNN和Transformer)复现了这一关键见解。