Adam is widely adopted in practical applications due to its fast convergence. However, its theoretical analysis is still far from satisfactory. Existing convergence analyses for Adam rely on the bounded smoothness assumption, referred to as the \emph{L-smooth condition}. Unfortunately, this assumption does not hold for many deep learning tasks. Moreover, we believe that this assumption obscures the true benefit of Adam, as the algorithm can adapt its update magnitude according to local smoothness. This important feature of Adam becomes irrelevant when assuming globally bounded smoothness. This paper studies the convergence of randomly reshuffled Adam (RR Adam) with diminishing learning rate, which is the major version of Adam adopted in deep learning tasks. We present the first convergence analysis of RR Adam without the bounded smoothness assumption. We demonstrate that RR Adam can maintain its convergence properties when smoothness is linearly bounded by the gradient norm, referred to as the \emph{$(L_0, L_1)$-smooth condition. We further compare Adam to SGD when both methods use diminishing learning rate. We refine the existing lower bound of SGD and show that SGD can be slower than Adam. To our knowledge, this is the first time that Adam and SGD are rigorously compared in the same setting and the advantage of Adam is revealed.
翻译:Adam算法因其快速收敛性在实际应用中被广泛采用。然而,其理论分析仍远未令人满意。现有对Adam的收敛性分析依赖于有界光滑性假设,即所谓的\emph{L-光滑条件}。遗憾的是,该假设对许多深度学习任务并不成立。此外,我们认为该假设掩盖了Adam算法的真正优势,因为该算法能够根据局部光滑性自适应调整更新幅度。当假设全局有界光滑性时,Adam的这一重要特性变得无关紧要。本文研究了采用递减学习率的随机重排Adam(RR Adam)的收敛性,这是深度学习任务中采用的主要Adam版本。我们首次在不依赖有界光滑性假设的条件下给出了RR Adam的收敛性分析。我们证明当光滑性被梯度范数线性有界时(即所谓的\emph{$(L_0, L_1)$-光滑条件}),RR Adam仍能保持其收敛特性。我们进一步比较了Adam与SGD在两者均采用递减学习率时的表现。我们改进了SGD的现有下界,并证明SGD可能比Adam更慢。据我们所知,这是首次在相同设置下对Adam和SGD进行严格比较,并揭示了Adam的优势。