When training neural networks, it has been widely observed that a large step size is essential in stochastic gradient descent (SGD) for obtaining superior models. However, the effect of large step sizes on the success of SGD is not well understood theoretically. Several previous works have attributed this success to the stochastic noise present in SGD. However, we show through a novel set of experiments that the stochastic noise is not sufficient to explain good non-convex training, and that instead the effect of a large learning rate itself is essential for obtaining best performance.We demonstrate the same effects also in the noise-less case, i.e. for full-batch GD. We formally prove that GD with large step size -- on certain non-convex function classes -- follows a different trajectory than GD with a small step size, which can lead to convergence to a global minimum instead of a local one. Our settings provide a framework for future analysis which allows comparing algorithms based on behaviors that can not be observed in the traditional settings.
翻译:在训练神经网络时,广泛观察到随机梯度下降(SGD)中使用大步长对于获得更优模型至关重要。然而,大步长对SGD成功的影响在理论上尚未得到充分理解。此前多项研究将此成功归因于SGD中存在的随机噪声。但我们通过一系列新颖实验表明,随机噪声不足以解释良好的非凸训练,相反,大学习率本身的作用对于获得最佳性能至关重要。我们在无噪声情形(即全批量梯度下降)中也验证了相同效应。我们正式证明,在特定非凸函数类上,使用大步长的梯度下降与使用小步长的梯度下降遵循不同轨迹,这可能导致收敛至全局最小值而非局部极小值。我们的研究为未来分析提供了框架,允许基于传统设置中无法观测的行为来比较不同算法。