In this work, we reveal a strong implicit bias of stochastic gradient descent (SGD) that drives overly expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization. To reveal this bias, we identify invariant sets, or subsets of parameter space that remain unmodified by SGD. We focus on two classes of invariant sets that correspond to simpler subnetworks and commonly appear in modern architectures. Our analysis uncovers that SGD exhibits a property of stochastic attractivity towards these simpler invariant sets. We establish a sufficient condition for stochastic attractivity based on a competition between the loss landscape's curvature around the invariant set and the noise introduced by stochastic gradients. Remarkably, we find that an increased level of noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss. We observe empirically the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses to simple subnetworks with either vanishing or redundant neurons. We further demonstrate how this simplifying process of stochastic collapse benefits generalization in a linear teacher-student framework. Finally, through this analysis, we mechanistically explain why early training with large learning rates for extended periods benefits subsequent generalization.
翻译:本文揭示了随机梯度下降(SGD)的一种强隐式偏差,促使过度表达的网络收敛至更简子网络,从而显著减少独立参数数量并提升泛化能力。为揭示此偏差,我们识别出SGD更新过程中保持不变的不变集(参数空间的子集),聚焦于两类与更简子网络对应且常见于现代架构的不变集。分析表明,SGD对这些更简不变集具有随机吸引性。我们基于损失景观在不变集周围的曲率与随机梯度引入的噪声之间的竞争关系,建立了随机吸引性的充分条件。值得注意的是,噪声水平的增强会强化吸引性,导致与训练损失鞍点或局部最大值相关联的吸引性不变集出现。实验观测表明,训练深度神经网络中存在吸引性不变集,这意味着SGD动力学常坍缩至含有消失或冗余神经元的简单子网络。我们进一步在线性教师-学生框架下证明,这种随机坍缩的简化过程如何促进泛化。最后,通过该分析,我们从机理层面解释了早期采用大学习率进行长时训练为何有利于后续泛化。