We consider the problem of estimating the learning rate in adaptive methods, such as Adagrad and Adam. We describe two techniques, Prodigy and Resetting, to provably estimate the distance to the solution $D$, which is needed to set the learning rate optimally. Our techniques are modifications of the D-Adaptation method for learning-rate-free learning. Our methods improve upon the convergence rate of D-Adaptation by a factor of $O(\sqrt{\log(D/d_0)})$, where $d_0$ is the initial estimate of $D$. We test our methods on 12 common logistic-regression benchmark datasets, VGG11 and ResNet-50 training on CIFAR10, ViT training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, VarNet on Knee MRI dataset, as well as RoBERTa and GPT transformer training on BookWiki. Our experimental results show that our approaches consistently outperform D-Adaptation and reach test accuracy values close to that of hand-tuned Adam.
翻译:我们研究了在自适应方法(如Adagrad和Adam)中估计学习率的问题。我们提出了两种技术——Prodigy和重置(Resetting),用于可证明地估计到解的距离$D$,这一距离是优化设置学习率所必需的。我们的技术是对无学习率学习方法D-Adaptation的改进。与D-Adaptation相比,我们的方法将收敛速度提升了$O(\sqrt{\log(D/d_0)})$倍,其中$d_0$是$D$的初始估计值。我们在12个常见的逻辑回归基准数据集、CIFAR10上的VGG11和ResNet-50训练、ImageNet上的ViT训练、IWSLT14上的LSTM训练、Criteo数据集上的DLRM训练、膝关节MRI数据集上的VarNet训练,以及BookWiki上的RoBERTa和GPT Transformer训练中测试了我们的方法。实验结果表明,我们的方法始终优于D-Adaptation,且能达到接近手动调参Adam的测试准确率。