We consider the problem of estimating the learning rate in adaptive methods, such as Adagrad and Adam. We describe two techniques, Prodigy and Resetting, to provably estimate the distance to the solution $D$, which is needed to set the learning rate optimally. Our techniques are modifications of the D-Adaptation method for learning-rate-free learning. Our methods improve upon the convergence rate of D-Adaptation by a factor of $O(\sqrt{\log(D/d_0)})$, where $d_0$ is the initial estimate of $D$. We test our methods on 12 common logistic-regression benchmark datasets, VGG11 and ResNet-50 training on CIFAR10, ViT training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, VarNet on Knee MRI dataset, as well as RoBERTa and GPT transformer training on BookWiki. Our experimental results show that our approaches consistently outperform D-Adaptation and reach test accuracy values close to that of hand-tuned Adam.
翻译:我们考虑自适应方法(如Adagrad和Adam)中学习率的估计问题。我们描述了两种技术——Prodigy和Resetting——用以可证明地估计到最优解的距离$D$,该距离是优化设置学习率所必需的。这些技术是对无学习率学习方法D-Adaptation的改进。我们的方法将D-Adaptation的收敛速度提升了$O(\sqrt{\log(D/d_0)})$倍,其中$d_0$是$D$的初始估计值。我们在12个常见的逻辑回归基准数据集、CIFAR10上的VGG11和ResNet-50训练、ImageNet上的ViT训练、IWSLT14上的LSTM训练、Criteo数据集上的DLRM训练、膝盖MRI数据集上的VarNet训练、以及BookWiki上的RoBERTa和GPT Transformer训练上进行了测试。实验结果表明,我们的方法始终优于D-Adaptation,并达到了接近手动调优Adam的测试准确率。