Heavy-ball momentum with decaying learning rates is widely used with SGD for optimizing deep learning models. In contrast to its empirical popularity, the understanding of its theoretical property is still quite limited, especially under the standard anisotropic gradient noise condition for quadratic regression problems. Although it is widely conjectured that heavy-ball momentum method can provide accelerated convergence and should work well in large batch settings, there is no rigorous theoretical analysis. In this paper, we fill this theoretical gap by establishing a non-asymptotic convergence bound for stochastic heavy-ball methods with step decay scheduler on quadratic objectives, under the anisotropic gradient noise condition. As a direct implication, we show that heavy-ball momentum can provide $\tilde{\mathcal{O}}(\sqrt{\kappa})$ accelerated convergence of the bias term of SGD while still achieving near-optimal convergence rate with respect to the stochastic variance term. The combined effect implies an overall convergence rate within log factors from the statistical minimax rate. This means SGD with heavy-ball momentum is useful in the large-batch settings such as distributed machine learning or federated learning, where a smaller number of iterations can significantly reduce the number of communication rounds, leading to acceleration in practice.
翻译:带衰减学习率的重球动量法与SGD结合已广泛用于深度学习模型优化。然而与其经验普及度形成对比的是,其理论性质的理解仍然相当有限,特别是在二次回归问题的标准各向异性梯度噪声条件下。尽管普遍推测重球动量法能提供加速收敛并在大批量设置中表现良好,但目前缺乏严格的理论分析。本文通过建立各向异性梯度噪声条件下具有步长衰减调度器的随机重球法在二次目标上的非渐近收敛界,填补了这一理论空白。作为直接推论,我们证明重球动量能在保持随机方差项近最优收敛率的同时,使SGD的偏差项获得$\tilde{\mathcal{O}}(\sqrt{\kappa})$的加速收敛。两者的综合效应使得总体收敛率在统计极小极大速率的对数因子范围内。这意味着在分布式机器学习或联邦学习等大批量设置中,采用重球动量的SGD可通过减少迭代次数显著降低通信轮次,从而在实际应用中实现加速。