In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. Despite the popularity and efficiency of the Adam algorithm in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this paper, we show that Adam provably converges to $\epsilon$-stationary points with $\mathcal{O}(\epsilon^{-4})$ gradient complexity under far more realistic conditions. The key to our analysis is a new proof of boundedness of gradients along the optimization trajectory, under a generalized smoothness assumption according to which the local smoothness (i.e., Hessian norm when it exists) is bounded by a sub-quadratic function of the gradient norm. Moreover, we propose a variance-reduced version of Adam with an accelerated gradient complexity of $\mathcal{O}(\epsilon^{-3})$.
翻译:本文对自适应矩估计(Adam)算法在广泛优化目标上的收敛性提供了严格证明。尽管Adam算法在训练深度神经网络中具有高效性和广泛适用性,其理论性质尚未被完全揭示,且现有收敛性证明需依赖强假设条件(如全局有界梯度)方能保证收敛至稳定点。本文证明,在更符合实际的假设条件下,Adam算法能以梯度复杂度$\mathcal{O}(\epsilon^{-4})$可证明地收敛至$\epsilon$-稳定点。分析的关键在于:在广义光滑性假设下——即局部光滑度(如存在Hessian矩阵时的范数)受梯度范数的次二次函数约束——对优化轨迹上梯度有界性提出了新的证明。此外,我们提出了Adam的方差缩减版本,其加速梯度复杂度可达$\mathcal{O}(\epsilon^{-3})$。