Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates

In this work, we theoretically investigate the generalization properties of neural networks (NN) trained by stochastic gradient descent (SGD) algorithm with large learning rates. Under such a training regime, our finding is that, the oscillation of the NN weights caused by the large learning rate SGD training turns out to be beneficial to the generalization of the NN, which potentially improves over the same NN trained by SGD with small learning rates that converges more smoothly. In view of this finding, we call such a phenomenon "benign oscillation". Our theory towards demystifying such a phenomenon builds upon the feature learning perspective of deep learning. Specifically, we consider a feature-noise data generation model that consists of (i) weak features which have a small $\ell_2$-norm and appear in each data point; (ii) strong features which have a larger $\ell_2$-norm but only appear in a certain fraction of all data points; and (iii) noise. We prove that NNs trained by oscillating SGD with a large learning rate can effectively learn the weak features in the presence of those strong features. In contrast, NNs trained by SGD with a small learning rate can only learn the strong features but makes little progress in learning the weak features. Consequently, when it comes to the new testing data which consist of only weak features, the NN trained by oscillating SGD with a large learning rate could still make correct predictions consistently, while the NN trained by small learning rate SGD fails. Our theory sheds light on how large learning rate training benefits the generalization of NNs. Experimental results demonstrate our finding on "benign oscillation".

翻译：本文从理论上研究了采用大学习率随机梯度下降（SGD）算法训练的神经网络（NN）的泛化性能。在这种训练机制下，我们发现，由大学习率SGD训练导致的网络权重振荡实际上有利于神经网络的泛化，其潜在表现优于采用收敛更平稳的小学习率SGD训练的同一神经网络。基于此发现，我们将这一现象称为“良性振荡”。我们揭示该现象的理论基于深度学习的特征学习视角。具体而言，我们考虑一个包含以下成分的特征-噪声数据生成模型：（i）弱特征——具有较小的$\ell_2$范数且出现在每个数据点中；（ii）强特征——具有较大的$\ell_2$范数但仅出现在所有数据点的特定比例中；以及（iii）噪声。我们证明，采用大学习率振荡SGD训练的神经网络能有效学习这些强特征背景下的弱特征，而采用小学习率SGD训练的神经网络仅能学习强特征，在学习弱特征方面进展甚微。因此，当面对仅含弱特征的新测试数据时，采用大学习率振荡SGD训练的神经网络仍能持续做出正确预测，而小学习率SGD训练的神经网络则失败。我们的理论揭示了大学习率训练如何提升神经网络泛化性能。实验结果验证了我们对“良性振荡”的发现。