Adaptive gradient methods are arguably the most successful optimization algorithms for neural network training. While it is well-known that adaptive gradient methods can achieve better dimensional dependence than stochastic gradient descent (SGD) under favorable geometry for stochastic convex optimization, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In this paper, we aim to close this gap by analyzing the convergence rates of AdaGrad measured by the $\ell_1$-norm of the gradient. Specifically, when the objective has $L$-Lipschitz gradient and the stochastic gradient variance is bounded by $\sigma^2$, we prove a worst-case convergence rate of $\tilde{\mathcal{O}}(\frac{\sqrt{d}L}{\sqrt{T}} + \frac{\sqrt{d} \sigma}{T^{1/4}})$, where $d$ is the dimension of the problem.We also present a lower bound of ${\Omega}(\frac{\sqrt{d}}{\sqrt{T}})$ for minimizing the gradient $\ell_1$-norm in the deterministic setting, showing the tightness of our upper bound in the noiseless case. Moreover, under more fine-grained assumptions on the smoothness structure of the objective and the gradient noise and under favorable gradient $\ell_1/\ell_2$ geometry, we show that AdaGrad can potentially shave a factor of $\sqrt{d}$ compared to SGD. To the best of our knowledge, this is the first result for adaptive gradient methods that demonstrates a provable gain over SGD in the non-convex setting.
翻译:自适应梯度方法无疑是神经网络训练中最成功的优化算法。尽管众所周知,在随机凸优化的有利几何条件下,自适应梯度方法能够比随机梯度下降(SGD)获得更好的维度依赖性,但其在随机非凸优化中成功的理论依据仍然不明。本文旨在通过分析以梯度$\ell_1$范数衡量的AdaGrad收敛速率来弥合这一差距。具体而言,当目标函数具有$L$-Lipschitz梯度且随机梯度方差以$\sigma^2$为界时,我们证明了最坏情况下的收敛速率为$\tilde{\mathcal{O}}(\frac{\sqrt{d}L}{\sqrt{T}} + \frac{\sqrt{d} \sigma}{T^{1/4}})$,其中$d$为问题维度。我们还给出了确定性设置下最小化梯度$\ell_1$范数的下界${\Omega}(\frac{\sqrt{d}}{\sqrt{T}})$,表明在无噪声情况下我们的上界是紧的。此外,在对目标函数平滑性结构及梯度噪声的更细粒度假设下,并在有利的梯度$\ell_1/\ell_2$几何条件下,我们证明了AdaGrad相比SGD可能削减一个$\sqrt{d}$因子。据我们所知,这是首个在非凸设置下证明自适应梯度方法相比SGD具有可证明优势的结果。