Adaptive first-order optimizers are fundamental tools in deep learning, although they may suffer from poor generalization due to the nonuniform gradient scaling. In this work, we propose AdamL, a novel variant of the Adam optimizer, that takes into account the loss function information to attain better generalization results. We provide sufficient conditions that together with the Polyak-Lojasiewicz inequality, ensure the linear convergence of AdamL. As a byproduct of our analysis, we prove similar convergence properties for the EAdam, and AdaBelief optimizers. Experimental results on benchmark functions show that AdamL typically achieves either the fastest convergence or the lowest objective function values when compared to Adam, EAdam, and AdaBelief. These superior performances are confirmed when considering deep learning tasks such as training convolutional neural networks, training generative adversarial networks using vanilla convolutional neural networks, and long short-term memory networks. Finally, in the case of vanilla convolutional neural networks, AdamL stands out from the other Adam's variants and does not require the manual adjustment of the learning rate during the later stage of the training.
翻译:自适应一阶优化器是深度学习中的基础工具,但其非均匀梯度缩放可能导致泛化性能不佳。本文提出AdamL,一种Adam优化器的新型变体,通过引入损失函数信息以获得更优的泛化结果。我们给出了在Polyak-Lojasiewicz不等式条件下保证AdamL线性收敛的充分条件。作为理论分析的副产品,我们还证明了EAdam和AdaBelief优化器具有相似的收敛性质。基准函数实验结果表明,与Adam、EAdam及AdaBelief相比,AdamL通常能实现最快收敛或最低目标函数值。在训练卷积神经网络、基于普通卷积神经网络的生成对抗网络以及长短期记忆网络等深度学习任务中,这些优越性能得到了验证。最后,在普通卷积神经网络场景下,AdamL在Adam系列变体中表现突出,且无需在训练后期手动调整学习率。