Adaptive optimization algorithms, particularly Adam and its variant AdamW, are fundamental components of modern deep learning. However, their training dynamics lack comprehensive theoretical understanding, with limited insight into why common practices - such as specific hyperparameter choices and normalization layers - contribute to successful generalization. This work presents a continuous-time formulation of Adam and AdamW, facilitating a tractable analysis of training dynamics that can shed light on such practical questions. We theoretically derive a stable region for Adam's hyperparameters $(\beta, \gamma)$ that ensures bounded updates, empirically verifying these predictions by observing unstable exponential growth of parameter updates outside this region. Furthermore, we theoretically justify the success of normalization layers by uncovering an implicit meta-adaptive effect of scale-invariant architectural components. This insight leads to an explicit optimizer, $2$-Adam, which we generalize to $k$-Adam - an optimizer that applies an adaptive normalization procedure $k$ times, encompassing Adam (corresponding to $k=1$) and Adam with a normalization layer (corresponding to $k=2$). Overall, our continuous-time formulation of Adam facilitates a principled analysis, offering deeper understanding of optimal hyperparameter choices and architectural decisions in modern deep learning.
翻译:自适应优化算法,特别是Adam及其变体AdamW,是现代深度学习的核心组成部分。然而,其训练动力学缺乏全面的理论理解,对于常见实践——例如特定超参数选择和归一化层——为何有助于成功泛化的洞察有限。本研究提出了Adam和AdamW的连续时间表述,从而促进了对训练动力学的可处理分析,以阐明此类实际问题。我们从理论上推导了Adam超参数$(\beta, \gamma)$的一个稳定区域,该区域确保了更新的有界性,并通过在该区域外观察参数更新呈指数级不稳定增长的实验现象,实证验证了这些预测。此外,我们通过揭示尺度不变架构组件的隐式元自适应效应,从理论上论证了归一化层的成功。这一洞见催生了一种显式优化器$2$-Adam,并将其推广至$k$-Adam——一种应用$k$次自适应归一化过程的优化器,其涵盖了Adam(对应$k=1$)以及带有归一化层的Adam(对应$k=2$)。总体而言,我们对Adam的连续时间表述促进了原则性分析,为现代深度学习中最佳超参数选择与架构决策提供了更深入的理解。