In this paper, we investigate the convergence properties of a wide class of Adam-family methods for minimizing quadratically regularized nonsmooth nonconvex optimization problems, especially in the context of training nonsmooth neural networks with weight decay. Motivated by the AdamW method, we propose a novel framework for Adam-family methods with decoupled weight decay. Within our framework, the estimators for the first-order and second-order moments of stochastic subgradients are updated independently of the weight decay term. Under mild assumptions and with non-diminishing stepsizes for updating the primary optimization variables, we establish the convergence properties of our proposed framework. In addition, we show that our proposed framework encompasses a wide variety of well-known Adam-family methods, hence offering convergence guarantees for these methods in the training of nonsmooth neural networks. More importantly, we show that our proposed framework asymptotically approximates the SGD method, thereby providing an explanation for the empirical observation that decoupled weight decay enhances generalization performance for Adam-family methods. As a practical application of our proposed framework, we propose a novel Adam-family method named Adam with Decoupled Weight Decay (AdamD), and establish its convergence properties under mild conditions. Numerical experiments demonstrate that AdamD outperforms Adam and is comparable to AdamW, in the aspects of both generalization performance and efficiency.
翻译:本文研究了一类广泛的Adam系列方法在极小化二次正则化非光滑非凸优化问题时的收敛性质,特别关注含权重衰减项的非光滑神经网络训练场景。受AdamW方法的启发,我们提出了一种新颖的Adam系列方法解耦权重衰减框架。在该框架中,随机次梯度的一阶矩和二阶矩估计器独立于权重衰减项进行更新。在温和假设条件下,当主优化变量的步长不递减时,我们建立了所提框架的收敛性质。此外,我们证明该框架涵盖了多种经典Adam系列方法,从而为非光滑神经网络训练中这些方法的收敛性提供了理论保证。更重要的是,我们证明所提框架渐近逼近SGD方法,这解释了经验观察中解耦权重衰减提升Adam系列方法泛化性能的原因。作为框架的实际应用,我们提出了一种名为AdamD(解耦权重衰减Adam)的新方法,并在温和条件下建立了其收敛性质。数值实验表明,AdamD在泛化性能和计算效率两方面均优于Adam,且与AdamW性能相当。