This paper provides the first tight convergence analyses for RMSProp and Adam in non-convex optimization under the most relaxed assumptions of coordinate-wise generalized smoothness and affine noise variance. We first analyze RMSProp, which is a special case of Adam with adaptive learning rates but without first-order momentum. Specifically, to solve the challenges due to dependence among adaptive update, unbounded gradient estimate and Lipschitz constant, we demonstrate that the first-order term in the descent lemma converges and its denominator is upper bounded by a function of gradient norm. Based on this result, we show that RMSProp with proper hyperparameters converges to an $\epsilon$-stationary point with an iteration complexity of $\mathcal O(\epsilon^{-4})$. We then generalize our analysis to Adam, where the additional challenge is due to a mismatch between the gradient and first-order momentum. We develop a new upper bound on the first-order term in the descent lemma, which is also a function of the gradient norm. We show that Adam with proper hyperparameters converges to an $\epsilon$-stationary point with an iteration complexity of $\mathcal O(\epsilon^{-4})$. Our complexity results for both RMSProp and Adam match with the complexity lower bound established in \cite{arjevani2023lower}.
翻译:本文首次在坐标级广义光滑性和仿射噪声方差的最松弛假设下,对非凸优化中的RMSProp和Adam算法进行了严格的收敛性分析。首先分析RMSProp——该算法是Adam在自适应学习率(无一阶动量)下的特例。具体而言,为解决自适应更新、无界梯度估计和Lipschitz常数之间的依赖关系带来的挑战,我们证明了下降引理中一阶项的收敛性,且其分母受梯度范数的函数上界控制。基于此结果,我们证明采用适当超参数的RMSProp算法以$\mathcal O(\epsilon^{-4})$的迭代复杂度收敛至$\epsilon$-稳定点。随后将分析推广至Adam,其额外挑战在于梯度与一阶动量之间的失配问题。我们为下降引理中的一阶项建立了新的上界,该上界同样为梯度范数的函数。结果表明,采用适当超参数的Adam算法以$\mathcal O(\epsilon^{-4})$的迭代复杂度收敛至$\epsilon$-稳定点。本文针对RMSProp和Adam的复杂度结果与文献\cite{arjevani2023lower}中建立的复杂度下界相匹配。