Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a $δ^{-1/2}$ dependence on the confidence parameter $δ$, whereas corresponding high-probability guarantee for SGD necessarily incurs at least a $δ^{-1}$ dependence.
翻译:尽管Adam在许多应用中展现出比SGD更快的经验收敛速度,现有理论给出的保证本质上仍与SGD相当,未能充分解释这一经验性能差距。本文揭示了Adam中关键的二阶矩归一化机制,并基于经典有界方差模型(二阶矩假设)发展了一套停时/鞅分析方法,从理论上严格区分了Adam与SGD的行为。特别地,我们首次在理论上分离了两种方法的高概率收敛行为:Adam对置信参数δ的依赖为$δ^{-1/2}$,而SGD对应的高概率保证必然至少产生$δ^{-1}$的依赖。