Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

In deep learning, different kinds of deep networks typically need different optimizers, which have to be chosen after multiple trials, making the training process inefficient. To relieve this issue and consistently improve the model training speed across deep networks, we propose the ADAptive Nesterov momentum algorithm, Adan for short. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the gradient's first- and second-order moments in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an $\epsilon$-approximate first-order stationary point within $O(\epsilon^{-3.5})$ stochastic gradient complexity on the non-convex stochastic problems (e.g., deep learning problems), matching the best-known lower bound. Extensive experimental results show that Adan consistently surpasses the corresponding SoTA optimizers on vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g., ResNet, ConvNext, ViT, Swin, MAE, DETR, GPT-2, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, GPT-2, MAE, e.t.c., and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k. Code is released at https://github.com/sail-sg/Adan, and has been used in multiple popular deep learning frameworks or projects.

翻译：在深度学习中，不同类型的深度网络通常需要不同的优化器，且需经过多次试验才能选定，导致训练过程效率低下。为解决这一问题并持续提升各类深度网络的训练速度，我们提出自适应Nesterov动量算法（简称Adan）。Adan首先重构经典Nesterov加速方法，开发出一种新型Nesterov动量估计（NME）技术，该技术避免了在插值点计算梯度的额外开销。在此基础上，Adan采用NME来估计自适应梯度算法中梯度的一阶和二阶矩，从而实现收敛加速。此外，我们证明Adan在非凸随机问题（如深度学习问题）上，能以$O(\epsilon^{-3.5})$的随机梯度复杂度找到$\epsilon$-近似一阶稳定点，匹配当前已知最优下界。大量实验结果表明，Adan在视觉、语言及强化学习任务中持续超越相应的SoTA优化器，并为ResNet、ConvNext、ViT、Swin、MAE、DETR、GPT-2、Transformer-XL及BERT等众多主流网络和框架创下新的SoTA性能。更令人振奋的是，Adan仅需SoTA优化器一半的训练成本（轮次）即可在ViT、GPT-2、MAE等模型上达到更高或相当的性能，同时展现出对大规模小批量（如1k至32k）的强鲁棒性。代码已开源至https://github.com/sail-sg/Adan，并被多个主流深度学习框架或项目采用。