Adam is a widely used optimization algorithm in deep learning, yet the specific class of objective functions where it exhibits inherent advantages remains underexplored. Unlike prior studies requiring external schedulers and $β_2$ near 1 for convergence, this work investigates the "natural" auto-convergence properties of Adam. We identify a class of highly degenerate polynomials where Adam converges automatically without additional schedulers. Specifically, we derive theoretical conditions for local asymptotic stability on degenerate polynomials and demonstrate strong alignment between theoretical bounds and experimental results. We prove that Adam achieves local linear convergence on these degenerate functions, significantly outperforming the sub-linear convergence of Gradient Descent and Momentum. This acceleration stems from a decoupling mechanism between the second moment $v_t$ and squared gradient $g_t^2$, which exponentially amplifies the effective learning rate. Finally, we characterize Adam's hyperparameter phase diagram, identifying three distinct behavioral regimes: stable convergence, spikes, and SignGD-like oscillation.
翻译:Adam是深度学习中广泛使用的优化算法,然而其展现出内在优势的特定目标函数类别仍未得到充分探索。与先前需要外部调度器且要求$β_2$接近1以保证收敛的研究不同,本文探究了Adam的"自然"自收敛特性。我们识别出一类高度退化多项式,在该类函数上Adam无需额外调度器即可自动收敛。具体而言,我们推导了在退化多项式上局部渐近稳定的理论条件,并证明了理论边界与实验结果之间的高度一致性。我们证明了Adam在此类退化函数上可实现局部线性收敛,显著优于梯度下降法和动量法的次线性收敛。这种加速源于二阶矩$v_t$与梯度平方$g_t^2$之间的解耦机制,该机制以指数方式放大了有效学习率。最后,我们刻画了Adam的超参数相图,识别出三种不同的行为模式:稳定收敛、峰值振荡以及类SignGD振荡。