A number of recent adaptive optimizers improve the generalisation performance of Adam by essentially reducing the variance of adaptive stepsizes to get closer to SGD with momentum. Following the above motivation, we suppress the range of the adaptive stepsizes of Adam by exploiting the layerwise gradient statistics. In particular, at each iteration, we propose to perform three consecutive operations on the second momentum v_t before using it to update a DNN model: (1): down-scaling, (2): epsilon-embedding, and (3): down-translating. The resulting algorithm is referred to as SET-Adam, where SET is a brief notation of the three operations. The down-scaling operation on v_t is performed layerwise by making use of the angles between the layerwise subvectors of v_t and the corresponding all-one subvectors. Extensive experimental results show that SET-Adam outperforms eight adaptive optimizers when training transformers and LSTMs for NLP, and VGG and ResNet for image classification over CIAF10 and CIFAR100 while matching the best performance of the eight adaptive methods when training WGAN-GP models for image generation tasks. Furthermore, SET-Adam produces higher validation accuracies than Adam and AdaBelief for training ResNet18 over ImageNet.
翻译:近期多项自适应优化器通过降低自适应步长的方差以逼近带动量的随机梯度下降法,从而改善了Adam的泛化性能。基于上述动机,本文通过利用分层梯度统计量来抑制Adam自适应步长的变化范围。具体而言,在每次迭代中,我们提出在使用第二动量v_t更新深度神经网络模型前对其执行三个连续操作:(1) 降尺度缩放,(2) ε嵌入,(3) 降平移。所得算法称为SET-Adam,其中SET是这三个操作的简称。对v_t的降尺度操作通过计算v_t的分层子向量与对应全一分层子向量之间的夹角来实现分层处理。大量实验结果表明:在训练Transformer和LSTM处理自然语言任务、训练VGG和ResNet进行CIFAR10与CIFAR100图像分类任务时,SET-Adam优于八种自适应优化器;在训练WGAN-GP模型进行图像生成任务时,其性能与八种自适应方法中的最优结果相当。此外,在使用ImageNet数据集训练ResNet18时,SET-Adam相比Adam和AdaBelief获得了更高的验证准确率。