Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this advantage is not theoretically well-understood. One challenge here is that though intuitively Adam with $\ell_2$ regularization optimizes the $\ell_2$ regularized loss, it is not clear if AdamW optimizes a specific objective. In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss under the constraint that the $\ell_\infty$ norm of the parameter is bounded by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to $\ell_\infty$ norm, and a surprising connection between normalized steepest descent with weight decay and Frank-Wolfe.
翻译:解耦权重衰减的Adam算法(即AdamW)因其在语言建模任务中的卓越性能而广受赞誉,在泛化能力和优化效率上均优于带$\ell_2$正则化的Adam。然而,这一优势目前缺乏理论层面的深入理解。其中一个挑战在于:尽管直观上带$\ell_2$正则化的Adam优化的是$\ell_2$正则化损失,但AdamW是否针对特定目标函数进行优化尚不明确。本文通过证明AdamW隐式执行约束优化,推动了对其优势的理解。具体而言,我们证明在全批次设置下,若AdamW在任意部分和发散的非递增学习率调度下收敛,则其必收敛至原始损失函数的一个KKT点,且该点满足参数$\ell_\infty$范数受权重衰减因子倒数约束的条件。这一结论基于以下两个发现:Adam可视为SignGD(即$\ell_\infty$范数下的归一化最速下降法)的平滑版本,以及归一化最速下降法与权重衰减之间与Frank-Wolfe算法存在惊人关联。