Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been thoroughly studied. In this paper, we provide a fine-grained convergence analysis for a general class of adaptive gradient methods including AMSGrad, RMSProp and AdaGrad. For smooth nonconvex functions, we prove that adaptive gradient methods in expectation converge to a first-order stationary point. Our convergence rate is better than existing results for adaptive gradient methods in terms of dimension. In addition, we also prove high probability bounds on the convergence rates of AMSGrad, RMSProp as well as AdaGrad, which have not been established before. Our analyses shed light on better understanding the mechanism behind adaptive gradient methods in optimizing nonconvex objectives.
翻译:自适应梯度方法是深度学习中的核心工具。然而,自适应梯度方法在非凸优化中的收敛性保证尚未得到充分研究。本文针对包括AMSGrad、RMSProp和AdaGrad在内的广义自适应梯度方法类,提供了细粒度的收敛性分析。对于光滑非凸函数,我们证明了自适应梯度方法在期望意义上收敛至一阶稳定点。我们的收敛速率在维度相关项上优于现有自适应梯度方法的收敛结果。此外,我们还首次建立了AMSGrad、RMSProp及AdaGrad收敛速率的高概率界。这些分析为深入理解自适应梯度方法在优化非凸目标函数背后的机制提供了新的见解。