This paper aims to clearly distinguish between Stochastic Gradient Descent with Momentum (SGDM) and Adam in terms of their convergence rates. We demonstrate that Adam achieves a faster convergence compared to SGDM under the condition of non-uniformly bounded smoothness. Our findings reveal that: (1) in deterministic environments, Adam can attain the known lower bound for the convergence rate of deterministic first-order optimizers, whereas the convergence rate of Gradient Descent with Momentum (GDM) has higher order dependence on the initial function value; (2) in stochastic setting, Adam's convergence rate upper bound matches the lower bounds of stochastic first-order optimizers, considering both the initial function value and the final error, whereas there are instances where SGDM fails to converge with any learning rate. These insights distinctly differentiate Adam and SGDM regarding their convergence rates. Additionally, by introducing a novel stopping-time based technique, we further prove that if we consider the minimum gradient norm during iterations, the corresponding convergence rate can match the lower bounds across all problem hyperparameters. The technique can also help proving that Adam with a specific hyperparameter scheduler is parameter-agnostic, which hence can be of independent interest.
翻译:本文旨在清晰区分带动量的随机梯度下降(SGDM)与Adam在收敛速率上的差异。我们证明,在非均匀有界光滑性条件下,Adam比SGDM实现了更快的收敛。研究发现:(1)在确定性环境中,Adam能够达到确定性一阶优化器收敛速率的已知下界,而带动量的梯度下降(GDM)的收敛速率对初始函数值具有更高阶的依赖性;(2)在随机环境下,综合考虑初始函数值和最终误差时,Adam的收敛速率上界与随机一阶优化器的下界相匹配,而存在某些情况下SGDM无法以任何学习率收敛。这些见解明确区分了Adam与SGDM在收敛速率上的差异。此外,通过引入一种新颖的基于停时技术的分析方法,我们进一步证明:若考虑迭代过程中的最小梯度范数,相应收敛速率可匹配所有问题超参数的下界。该技术还可证明采用特定超参数调度策略的Adam具有参数无关性,因此具有独立的研究价值。