While momentum-based accelerated variants of stochastic gradient descent (SGD) are widely used when training machine learning models, there is little theoretical understanding on the generalization error of such methods. In this work, we first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. Then, for smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes, and show that it can train machine learning models for multiple epochs with a guarantee for generalization. Finally, for the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes. Extending our results on generalization, we also develop an upper bound on the expected true risk, in terms of the number of training steps, sample size, and momentum. Our experimental evaluations verify the consistency between the numerical results and our theoretical bounds. SGDEM improves the generalization error of SGDM when training ResNet-18 on ImageNet in practical distributed settings.
翻译:尽管基于动量的随机梯度下降(SGD)加速变体在训练机器学习模型时被广泛使用,但对此类方法的泛化误差仍缺乏理论理解。本文首先证明,存在一种凸损失函数,使得采用标准重球动量(SGDM)的多轮次SGD稳定性间隔无界。随后,针对光滑Lipschitz损失函数,我们分析了修正的动量更新规则(即早期动量随机梯度下降法SGDEM)在广泛步长范围下的表现,并证明其能以泛化保证训练多轮次机器学习模型。最后,在强凸损失函数的特殊情形下,我们找到了使标准SGDM(作为SGDEM的特例)也能实现多轮次泛化的动量取值范围。基于泛化结论的延伸,我们还推导出以训练步数、样本量和动量为变量的期望真实风险上界。实验评估验证了数值结果与理论界的一致性。在实际分布式环境中训练ResNet-18于ImageNet数据集时,SGDEM相比SGDM显著降低了泛化误差。