We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics orthogonal to the bouncing directions that biases it implicitly toward sparse predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used so that the regularization effect comes solely from the SGD training dynamics influenced by the step size schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. Finally, this analysis allows us to shed a new light on some common practice and observed phenomena when training neural networks. The code of our experiments is available at https://github.com/tml-epfl/sgd-sparse-features.
翻译:我们展示了随机梯度下降(SGD)在神经网络训练过程中的重要动力学特征。通过实证观察发现,常用的大步长设置会(i)导致迭代轨迹在损失面谷底两侧来回跳跃,从而引发损失值稳定化现象,并且(ii)这种稳定化效应会在跳跃方向的正交方向上诱发隐式随机动力学,使模型倾向于学习稀疏预测因子。进一步实验表明,大步长使SGD维持在损失面谷底高处的持续时间越长,隐式正则化效应就越好,从而更有效地发现稀疏表示。值得注意的是,整个过程中未使用任何显式正则化项,正则化效果完全源自步长策略调控下的SGD训练动力学。这些发现揭示了步长策略如何通过梯度与噪声的共同作用,驱动SGD在神经网络损失面上的动力学演化。我们通过简单神经网络模型的理论分析以及随机过程启发的定性论证,为上述发现提供了理论依据。最终,这一分析为神经网络训练中的常见实践和观测现象提供了全新视角。实验代码已开源在https://github.com/tml-epfl/sgd-sparse-features。