Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., $\beta_2$, in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of $\mathcal{O} ( 1 / \sqrt{T} )$ with any choice of $\beta_2$ without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at https://github.com/iShohei220/adopt.
翻译:Adam是深度学习中最流行的优化算法之一。然而,理论上已知除非以问题依赖的方式选择超参数$\beta_2$,否则Adam无法收敛。已有许多尝试修复此非收敛性问题(例如AMSGrad),但它们需要梯度噪声均匀有界这一不切实际的假设。本文提出一种名为ADOPT的新型自适应梯度方法,该方法能在任意选择$\beta_2$的情况下达到$\mathcal{O} ( 1 / \sqrt{T} )$的最优收敛速率,且不依赖于有界噪声假设。ADOPT通过从二阶矩估计中移除当前梯度,并调整动量更新与二阶矩估计归一化的顺序,解决了Adam的非收敛性问题。我们还进行了大量数值实验,验证了ADOPT在图像分类、生成建模、自然语言处理和深度强化学习等广泛任务中,相比Adam及其变体均能取得更优结果。实现代码发布于https://github.com/iShohei220/adopt。