In this paper, we study the convergence of the Adaptive Moment Estimation (Adam) algorithm under unconstrained non-convex smooth stochastic optimizations. Despite the widespread usage in machine learning areas, its theoretical properties remain limited. Prior researches primarily investigated Adam's convergence from an expectation view, often necessitating strong assumptions like uniformly stochastic bounded gradients or problem-dependent knowledge in prior. As a result, the applicability of these findings in practical real-world scenarios has been constrained. To overcome these limitations, we provide a deep analysis and show that Adam could converge to the stationary point in high probability with a rate of $\mathcal{O}\left({\rm poly}(\log T)/\sqrt{T}\right)$ under coordinate-wise "affine" variance noise, not requiring any bounded gradient assumption and any problem-dependent knowledge in prior to tune hyper-parameters. Additionally, it is revealed that Adam confines its gradients' magnitudes within an order of $\mathcal{O}\left({\rm poly}(\log T)\right)$. Finally, we also investigate a simplified version of Adam without one of the corrective terms and obtain a convergence rate that is adaptive to the noise level.
翻译:本文研究了自适应矩估计(Adam)算法在无约束非凸光滑随机优化中的收敛性。尽管该算法在机器学习领域被广泛应用,其理论性质仍十分有限。先前研究主要从期望视角分析Adam的收敛性,通常需要强假设条件,例如随机梯度一致有界或依赖先验问题知识。因此,这些结论在实际场景中的适用性受到制约。为克服上述局限,我们进行了深入分析,证明在坐标级"仿射"方差噪声下,Adam能够以$\mathcal{O}\left({\rm poly}(\log T)/\sqrt{T}\right)$的速率高概率收敛至驻点,该收敛性不要求梯度有界假设,也无需依赖先验问题知识调整超参数。此外,研究揭示Adam算法将梯度幅值约束在$\mathcal{O}\left({\rm poly}(\log T)\right)$量级内。最后,我们探究了移除一个修正项的Adam简化版本,并获得了可自适应噪声水平的收敛速率。