We prove explicit bounds on the exponential rate of convergence for the momentum stochastic gradient descent scheme (MSGD) for arbitrary, fixed hyperparameters (learning rate, friction parameter) and its continuous-in-time counterpart in the context of non-convex optimization. In the small step-size regime and in the case of flat minima or large noise intensities, these bounds prove faster convergence of MSGD compared to plain stochastic gradient descent (SGD). The results are shown for objective functions satisfying a local Polyak-Lojasiewicz inequality and under assumptions on the variance of MSGD that are satisfied in overparametrized settings. Moreover, we analyze the optimal choice of the friction parameter and show that the MSGD process almost surely converges to a local minimum.
翻译:我们证明了动量随机梯度下降方案(MSGD)及其连续时间对应方法在非凸优化背景下,对于任意固定超参数(学习率、摩擦参数)的指数收敛速率存在显式界。在小步长机制下,以及在平坦极小值或大噪声强度的情况下,这些界证明了MSGD相比普通随机梯度下降(SGD)具有更快的收敛速度。该结果适用于满足局部Polyak-Lojasiewicz不等式的目标函数,并在过参数化设定下满足MSGD方差假设的条件下成立。此外,我们分析了摩擦参数的最优选择,并证明MSGD过程几乎必然收敛到局部极小值。