Stochastic gradient descent with momentum (SGDM) has been widely used in many machine learning and statistical applications. Despite the observed empirical benefits of SGDM over traditional SGD, the theoretical understanding of the role of momentum for different learning rates in the optimization process remains widely open. We analyze the finite-sample convergence rate of SGDM under the strongly convex settings and show that, with a large batch size, the mini-batch SGDM converges faster than mini-batch SGD to a neighborhood of the optimal value. Furthermore, we analyze the Polyak-averaging version of the SGDM estimator, establish its asymptotic normality, and justify its asymptotic equivalence to the averaged SGD.
翻译:带有动量的随机梯度下降(SGDM)已广泛应用于机器学习和统计应用领域。尽管SGDM相较于传统SGD具有经验上的优势,但关于动量在不同学习率下对优化过程作用的理论基础仍存在大量空白。我们在强凸条件下分析了SGDM的有限样本收敛速率,并证明当批处理量较大时,小批量SGDM比小批量SGD更快收敛至最优值邻域。此外,我们分析了SGDM估计量的Polyak平均版本,建立了其渐近正态性,并验证了其与平均SGD的渐近等价性。