We consider the problem of estimating the learning rate in adaptive methods, such as AdaGrad and Adam. We propose Prodigy, an algorithm that provably estimates the distance to the solution $D$, which is needed to set the learning rate optimally. At its core, Prodigy is a modification of the D-Adaptation method for learning-rate-free learning. It improves upon the convergence rate of D-Adaptation by a factor of $O(\sqrt{\log(D/d_0)})$, where $d_0$ is the initial estimate of $D$. We test Prodigy on 12 common logistic-regression benchmark datasets, VGG11 and ResNet-50 training on CIFAR10, ViT training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, VarNet on Knee MRI dataset, as well as RoBERTa and GPT transformer training on BookWiki. Our experimental results show that our approach consistently outperforms D-Adaptation and reaches test accuracy values close to that of hand-tuned Adam.
翻译:我们研究了自适应方法(如AdaGrad和Adam)中学习率的估计问题。我们提出Prodigy算法,该算法可证明地估计到最优解的距离$D$,而该距离是设置最优学习率所需的。Prodigy的核心是对无学习率学习的D-Adaptation方法的改进。它通过$O(\sqrt{\log(D/d_0)})$的因子改进了D-Adaptation的收敛速度,其中$d_0$是$D$的初始估计值。我们在12个常见的逻辑回归基准数据集、CIFAR10上的VGG11和ResNet-50训练、Imagenet上的ViT训练、IWSLT14上的LSTM训练、Criteo数据集上的DLRM训练、Knee MRI数据集上的VarNet训练,以及BookWiki上的RoBERTa和GPT变压器训练上测试了Prodigy。我们的实验结果表明,该方法始终优于D-Adaptation,并且达到了接近手工调优Adam的测试准确率。