The Adaptive Momentum Estimation (Adam) algorithm is highly effective in training various deep learning tasks. Despite this, there's limited theoretical understanding for Adam, especially when focusing on its vanilla form in non-convex smooth scenarios with potential unbounded gradients and affine variance noise. In this paper, we study vanilla Adam under these challenging conditions. We introduce a comprehensive noise model which governs affine variance noise, bounded noise and sub-Gaussian noise. We show that Adam can find a stationary point with a $\mathcal{O}(\text{poly}(\log T)/\sqrt{T})$ rate in high probability under this general noise model where $T$ denotes total number iterations, matching the lower rate of stochastic first-order algorithms up to logarithm factors. More importantly, we reveal that Adam is free of tuning step-sizes with any problem-parameters, yielding a better adaptation property than the Stochastic Gradient Descent under the same conditions. We also provide a probabilistic convergence result for Adam under a generalized smooth condition which allows unbounded smoothness parameters and has been illustrated empirically to more accurately capture the smooth property of many practical objective functions.
翻译:自适应动量估计(Adam)算法在训练各类深度学习任务中展现出卓越效能。然而,针对Adam算法的理论理解仍显不足,尤其当聚焦于非凸平滑场景下的原始形式时,该场景可能涉及无界梯度与仿射方差噪声。本文针对这些挑战性条件,对原始Adam算法展开研究。我们提出了一种涵盖仿射方差噪声、有界噪声及次高斯噪声的综合性噪声模型。在该通用噪声模型下,我们证明Adam能以高概率达到$\mathcal{O}(\text{poly}(\log T)/\sqrt{T})$的平稳点收敛速率,其中$T$表示总迭代次数,该速率与随机一阶算法的下界仅相差对数因子。更重要的是,我们揭示出Adam无需根据任何问题参数调整步长,在相同条件下展现出比随机梯度下降更优的自适应特性。我们还给出了广义光滑条件下的Adam概率收敛结果,该条件允许无界光滑参数,且经验表明其能更准确地刻画众多实际目标函数的光滑性质。