Adaptive optimizers, such as Adam, have achieved remarkable success in deep learning. A key component of these optimizers is the so-called preconditioning matrix, providing enhanced gradient information and regulating the step size of each gradient direction. In this paper, we propose a novel approach to designing the preconditioning matrix by utilizing the gradient difference between two successive steps as the diagonal elements. These diagonal elements are closely related to the Hessian and can be perceived as an approximation of the inner product between the Hessian row vectors and difference of the adjacent parameter vectors. Additionally, we introduce an auto-switching function that enables the preconditioning matrix to switch dynamically between Stochastic Gradient Descent (SGD) and the adaptive optimizer. Based on these two techniques, we develop a new optimizer named AGD that enhances the generalization performance. We evaluate AGD on public datasets of Natural Language Processing (NLP), Computer Vision (CV), and Recommendation Systems (RecSys). Our experimental results demonstrate that AGD outperforms the state-of-the-art (SOTA) optimizers, achieving highly competitive or significantly better predictive performance. Furthermore, we analyze how AGD is able to switch automatically between SGD and the adaptive optimizer and its actual effects on various scenarios. The code is available at https://github.com/intelligent-machine-learning/dlrover/tree/master/atorch/atorch/optimizers.
翻译:自适应优化器(如Adam)在深度学习中取得了显著成功。这类优化器的核心组件是所谓的预条件矩阵,它能够增强梯度信息并调节每个梯度方向上的步长。本文提出了一种新颖的预条件矩阵设计方法,通过利用连续两步之间的梯度差作为对角元素。这些对角元素与Hessian矩阵密切相关,可视为Hessian矩阵行向量与相邻参数向量差值之间内积的近似。此外,我们引入了一种自动切换函数,使预条件矩阵能够在随机梯度下降(SGD)与自适应优化器之间动态切换。基于这两项技术,我们开发了名为AGD的新优化器,以提升泛化性能。我们在自然语言处理(NLP)、计算机视觉(CV)和推荐系统(RecSys)的公开数据集上评估了AGD。实验结果表明,AGD超越了当前最先进(SOTA)的优化器,实现了具有高度竞争力或显著更优的预测性能。此外,我们分析了AGD如何在SGD与自适应优化器之间自动切换,以及其在多种场景下的实际效果。代码开源地址:https://github.com/intelligent-machine-learning/dlrover/tree/master/atorch/atorch/optimizers。