Stochastic gradient descent with momentum (SGDM) has been widely used in many machine learning and statistical applications. Despite the observed empirical benefits of SGDM over traditional SGD, the theoretical understanding of the role of momentum for different learning rates in the optimization process remains widely open. We analyze the finite-sample convergence rate of SGDM under the strongly convex settings and show that, with a large batch size, the mini-batch SGDM converges faster than the mini-batch SGD to a neighborhood of the optimal value. Additionally, our findings, supported by theoretical analysis and numerical experiments, indicate that SGDM permits broader choices of learning rates. Furthermore, we analyze the Polyak-averaging version of the SGDM estimator, establish its asymptotic normality, and justify its asymptotic equivalence to the averaged SGD. The asymptotic distribution of the averaged SGDM enables uncertainty quantification of the algorithm output and statistical inference of the model parameters.
翻译:动量随机梯度下降(SGDM)已广泛应用于机器学习和统计应用中。尽管SGDM在经验上优于传统SGD,但其动量在不同学习率下对优化过程作用的理论理解仍不充分。我们分析了强凸条件下SGDM的有限样本收敛率,并证明当批量较大时,小批量SGDM比小批量SGD更快收敛到最优值邻域。此外,理论分析与数值实验表明,SGDM允许更宽的学习率选择范围。进一步地,我们分析了SGDM估计量的Polyak平均版本,建立了其渐近正态性,并证明了其与平均SGD的渐近等价性。平均SGDM的渐近分布为算法输出的不确定性量化及模型参数的统计推断提供了依据。