In this paper, we introduce weight prediction into the AdamW optimizer to boost its convergence when training the deep neural network (DNN) models. In particular, ahead of each mini-batch training, we predict the future weights according to the update rule of AdamW and then apply the predicted future weights to do both forward pass and backward propagation. In this way, the AdamW optimizer always utilizes the gradients w.r.t. the future weights instead of current weights to update the DNN parameters, making the AdamW optimizer achieve better convergence. Our proposal is simple and straightforward to implement but effective in boosting the convergence of DNN training. We performed extensive experimental evaluations on image classification and language modeling tasks to verify the effectiveness of our proposal. The experimental results validate that our proposal can boost the convergence of AdamW and achieve better accuracy than AdamW when training the DNN models.
翻译:本文提出将权重预测引入AdamW优化器,以提升深度神经网络(DNN)模型训练中的收敛速度。具体而言,在每个小批量训练开始前,我们根据AdamW的更新规则预测未来权重,并将预测得到的未来权重同时用于前向传播和反向传播。通过这种方式,AdamW优化器始终利用相对于未来权重的梯度而非当前权重来更新DNN参数,从而使其获得更优的收敛性能。该方法实现简单直接,却能有效提升DNN训练的收敛速度。我们在图像分类和语言建模任务上进行了充分的实验评估,以验证所提方法的有效性。实验结果表明,该方法能够加速AdamW的收敛,并在训练DNN模型时获得比原始AdamW更高的准确率。