We present new adaptive learning rates that can be used with any momentum method. To showcase our new learning rates we develop MoMo and MoMo-Adam, which are SGD with momentum (SGDM) and Adam together with our new adaptive learning rates. Our MoMo methods are motivated through model-based stochastic optimization, wherein we use momentum estimates of the batch losses and gradients sampled at each iteration to build a model of the loss function. Our model also makes use of any known lower bound of the loss function by using truncation. Indeed most losses are bounded below by zero. We then approximately minimize this model at each iteration to compute the next step. For losses with unknown lower bounds, we develop new on-the-fly estimates of the lower bound that we use in our model. Numerical experiments show that our MoMo methods improve over SGDM and Adam in terms of accuracy and robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR10, CIFAR100, Imagenet32, DLRM on the Criteo dataset, and a transformer model on the translation task IWSLT14.
翻译:我们提出了一种可与任意动量方法结合使用的新型自适应学习率。为展示新学习率的有效性,我们开发了MoMo和MoMo-Adam方法,它们分别是融合新自适应学习率的带动量随机梯度下降(SGDM)和Adam算法。MoMo方法受基于模型的随机优化启发,利用每次迭代中批次损失和梯度的动量估计来构建损失函数模型。该模型通过截断技术充分利用损失函数的已知下界(事实上大多数损失函数以零为下界)。随后,我们在每次迭代中近似最小化该模型以计算下一步更新。针对下界未知的损失函数,我们开发了实时估计下界的在线方法并应用于模型。数值实验表明,在MNIST、CIFAR10、CIFAR100、Imagenet32图像分类任务,Criteo数据集的DLRM模型,以及IWSLT14翻译任务的Transformer模型训练中,MoMo方法相比SGDM和Adam在精度和对超参数调参的鲁棒性上均有所提升。