Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

from arxiv, 34 pages; added new lower bounds for SGD and AdaGrad in terms of $\ell_1$-norm, and for deterministic first-order methods in terms of $\ell_2$-norm

Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) under favorable geometry for stochastic convex optimization, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal (up to absolute constants) in terms of finding a near-stationary point with respect to the $\ell_2$-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the $\ell_1$-norm of the gradient as the stationarity measure, as opposed to the standard $\ell_2$-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the $\ell_1$-norm stationarity measure, we establish an upper bound on the convergence rate of AdaGrad and a corresponding lower bound for SGD. In particular, for certain configurations of problem parameters, we show that the iteration complexity of AdaGrad outperforms SGD by a factor of $d$. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions.

翻译：自适应梯度方法（如AdaGrad）是神经网络训练中最成功的优化算法之一。尽管已知在随机凸优化的有利几何条件下，这类方法相比随机梯度下降（SGD）能获得更好的维度依赖性，但其在随机非凸优化中成功的理论依据仍不明确。事实上，在Lipschitz梯度和有界噪声方差的标准假设下，已知SGD在寻找$\ell_2$范数意义下的近稳定点方面具有最坏情况最优性（至多相差绝对常数），这意味着不可能获得进一步的改进。受此局限性的启发，我们针对目标函数的光滑性结构和梯度噪声方差提出了更精细的假设，这些假设更适应自适应梯度方法的坐标特性。此外，我们采用梯度的$\ell_1$范数作为稳定性度量标准，而非标准的$\ell_2$范数，以匹配坐标维度的分析框架，并为AdaGrad获得更严格的收敛保证。在这些新假设和$\ell_1$范数稳定性度量下，我们建立了AdaGrad收敛速率的上界以及SGD对应的下界。特别地，对于特定问题参数配置，我们证明AdaGrad的迭代复杂度优于SGD达$d$倍。据我们所知，这是在非凸设定下首次证明自适应梯度方法相对于SGD具有可验证优势的研究结果。我们还提出了支持性的下界分析，包括一个针对AdaGrad的特定下界和一个适用于一般确定性一阶方法的通用下界，表明在某些条件下我们得到的AdaGrad上界是紧致的，且在对数因子范围内不可改进。