Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine-tuning language models on downstream tasks.
翻译:动量已知可在无随机梯度噪声的强凸设置中加速梯度下降的收敛。在随机优化(如训练神经网络)中,经验表明动量可能通过降低随机梯度更新的方差来帮助深度学习优化,但先前的理论分析并未发现动量能提供任何可证明的加速。本文的理论结果阐明了在学习率较小且梯度噪声是主要不稳定源的随机设置中动量的作用,表明无论是否使用动量,SGD在短时间和长时间范围内的行为相似。实验表明,在最优学习率不大(包括在ImageNet上从头训练的小批量到中批量训练以及在下游任务上微调语言模型)的实际训练场景中,动量对优化和泛化的益处确实有限。