In the training of neural networks, adaptive moment estimation (Adam) typically converges fast but exhibits suboptimal generalization performance. A widely accepted explanation for its defect in generalization is that it often tends to converge to sharp minima. To enhance its ability to find flat minima, we propose its new variant named inverse Adam (InvAdam). The key improvement of InvAdam lies in its parameter update mechanism, which is opposite to that of Adam. Specifically, it computes element-wise multiplication of the first-order and second-order moments, while Adam computes the element-wise division of these two moments. This modification aims to increase the step size of the parameter update when the elements in the second-order moments are large and vice versa, which helps the parameter escape sharp minima and stay at flat ones. However, InvAdam's update mechanism may face challenges in convergence. To address this challenge, we propose dual Adam (DualAdam), which integrates the update mechanisms of both Adam and InvAdam, ensuring convergence while enhancing generalization performance. Additionally, we introduce the diffusion theory to mathematically demonstrate InvAdam's ability to escape sharp minima. Extensive experiments are conducted on image classification tasks and large language model (LLM) fine-tuning. The results validate that DualAdam outperforms Adam and its state-of-the-art variants in terms of generalization performance. The code is publicly available at https://github.com/LongJin-lab/DualAdam.
翻译:在神经网络训练中,自适应矩估计(Adam)通常收敛迅速,但表现出次优的泛化性能。对其泛化缺陷的一种广泛接受的解释是,它往往倾向于收敛到尖锐最小值。为了增强其寻找平坦最小值的能力,我们提出了一种名为逆Adam(InvAdam)的新变体。InvAdam的关键改进在于其参数更新机制与Adam相反。具体而言,它计算一阶矩和二阶矩的逐元素乘法,而Adam计算这两个矩的逐元素除法。这一修改旨在当二阶矩中的元素较大时增加参数更新的步长,反之亦然,从而帮助参数逃离尖锐最小值并保持在平坦最小值处。然而,InvAdam的更新机制可能在收敛方面面临挑战。为解决这一问题,我们提出了对偶Adam(DualAdam),它整合了Adam和InvAdam两者的更新机制,在确保收敛的同时增强了泛化性能。此外,我们引入扩散理论,从数学上证明了InvAdam逃离尖锐最小值的能力。在图像分类任务和大语言模型(LLM)微调上进行了大量实验。结果验证了DualAdam在泛化性能方面优于Adam及其最先进的变体。代码公开于 https://github.com/LongJin-lab/DualAdam。