In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. Despite the popularity and efficiency of the Adam algorithm in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this paper, we show that Adam provably converges to $\epsilon$-stationary points with $\mathcal{O}(\epsilon^{-4})$ gradient complexity under far more realistic conditions. The key to our analysis is a new proof of boundedness of gradients along the optimization trajectory of Adam, under a generalized smoothness assumption according to which the local smoothness (i.e., Hessian norm when it exists) is bounded by a sub-quadratic function of the gradient norm. Moreover, we propose a variance-reduced version of Adam with an accelerated gradient complexity of $\mathcal{O}(\epsilon^{-3})$.
翻译:本文给出了自适应矩估计(Adam)算法在广泛优化目标上的严格收敛性证明。尽管Adam算法在深度神经网络训练中具有广泛的应用性和高效性,但其理论性质尚未完全明确,且现有的收敛性证明需要全局有界梯度等不切实际的强假设,才能证明确切收敛至驻点。本文证明:在远更实际条件下,Adam算法能以$\mathcal{O}(\epsilon^{-4})$的梯度复杂度可靠收敛至$\epsilon$-驻点。分析的关键在于:在广义光滑性假设下(即局部光滑性(若存在Hessian范数时)受梯度范数的次二次函数约束),我们首次证明了Adam优化轨迹上梯度的有界性。此外,我们提出了方差缩减版本的Adam算法,其加速梯度复杂度达到$\mathcal{O}(\epsilon^{-3})$。