MoMo: Momentum Models for Adaptive Learning Rates

We present new adaptive learning rates that can be used with any momentum method. To showcase our new learning rates we develop MoMo and MoMo-Adam, which are SGD with momentum (SGDM) and Adam together with our new adaptive learning rates. Our MoMo methods are motivated through model-based stochastic optimization, wherein we use momentum estimates of the batch losses and gradients sampled at each iteration to build a model of the loss function. Our model also makes use of any known lower bound of the loss function by using truncation. Indeed most losses are bounded below by zero. We then approximately minimize this model at each iteration to compute the next step. For losses with unknown lower bounds, we develop new on-the-fly estimates of the lower bound that we use in our model. Numerical experiments show that our MoMo methods improve over SGDM and Adam in terms of accuracy and robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR10, CIFAR100, Imagenet32, DLRM on the Criteo dataset, and a transformer model on the translation task IWSLT14.

翻译：我们提出了一种可与任意动量方法结合使用的新型自适应学习率。为展示新学习率的有效性，我们开发了MoMo和MoMo-Adam方法，它们分别是融合新自适应学习率的带动量随机梯度下降（SGDM）和Adam算法。MoMo方法受基于模型的随机优化启发，利用每次迭代中批次损失和梯度的动量估计来构建损失函数模型。该模型通过截断技术充分利用损失函数的已知下界（事实上大多数损失函数以零为下界）。随后，我们在每次迭代中近似最小化该模型以计算下一步更新。针对下界未知的损失函数，我们开发了实时估计下界的在线方法并应用于模型。数值实验表明，在MNIST、CIFAR10、CIFAR100、Imagenet32图像分类任务，Criteo数据集的DLRM模型，以及IWSLT14翻译任务的Transformer模型训练中，MoMo方法相比SGDM和Adam在精度和对超参数调参的鲁棒性上均有所提升。

相关内容

自适应学习

关注 10

自适应学习，也被称为自适应教学，是使用计算机算法来协调与学习者的互动，并提供定制学习资源和学习活动来解决每个学习者的独特需求的教育方法。在专业的学习情境，个人可以“试验出”一些训练方式，以确保教学内容的更新。根据学生的学习需要，计算机生成适应其特点的教育材料，包括他们对问题的回答和完成的任务和经验。该技术涵盖了各个研究领域和它们的衍生，包括计算机科学、人工智能、心理测验、教育学、心理学和脑科学。

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

专知会员服务

69+阅读 · 2021年3月27日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

52+阅读 · 2020年12月14日

多标签学习的新趋势（2020 Survey）

专知会员服务

44+阅读 · 2020年12月6日

专知会员服务

39+阅读 · 2020年11月3日

因果图，Causal Graphs，52页ppt