Stochastic gradient descent is a workhorse for training deep neural networks due to its excellent generalization performance. Several studies demonstrated this success is attributed to the implicit bias of the method that prefers a flat minimum and developed new methods based on this perspective. Recently, Izmailov et al. (2018) empirically observed that an averaged stochastic gradient descent with a large step size can bring out the implicit bias more effectively and can converge more stably to a flat minimum than the vanilla stochastic gradient descent. In our work, we theoretically justify this observation by showing that the averaging scheme improves the bias-optimization tradeoff coming from the stochastic gradient noise: a large step size amplifies the bias but makes convergence unstable, and vice versa. Specifically, we show that the averaged stochastic gradient descent can get closer to a solution of a penalized objective on the sharpness than the vanilla stochastic gradient descent using the same step size under certain conditions. In experiments, we verify our theory and show this learning scheme significantly improves performance.
翻译:随机梯度下降因其卓越的泛化性能而成为训练深度神经网络的主力工具。多项研究表明,这一成功归因于该方法偏好平坦最小值的隐式偏差,并基于此视角开发了新方法。近期,Izmailov等人(2018)通过实验观察到,采用大步长的平均化随机梯度下降比原始随机梯度下降更有效地激发隐式偏差,并更稳定地收敛到平坦最小值。在本工作中,我们从理论上论证了这一观察结果:通过证明平均化方案改进了随机梯度噪声带来的偏差-优化权衡——大步长虽增强偏差但导致收敛不稳定,反之亦然。具体而言,我们证明了在特定条件下,使用相同步长的平均化随机梯度下降比原始随机梯度下降更接近关于尖锐度的惩罚目标解。实验中,我们验证了理论并表明该学习方案显著提升了性能。