Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules. Our main technical contribution is a refined analysis of learning rate schedules for a wide class of optimization algorithms (including SGD). When considering only worst-case analysis, our theory predicts that the optimal choice is the linear decay schedule where the step-size is set proportional to 1 - t/T, where t is the current iteration and T is the total number of steps. To go beyond this worst-case analysis, we use the observed gradient norms to derive schedules refined for any particular task. These refined schedules exhibit learning rate warm-up and rapid learning rate annealing near the end of training. Ours is the first systematic approach to automatically yield both of these properties. We perform the most comprehensive evaluation of learning rate schedules to date, evaluating across 10 diverse deep learning problems, a series of LLMs, and a suite of logistic regression problems. We validate that overall, the linear-decay schedule outperforms all commonly used default schedules including cosine annealing. Our adaptive schedule refinement method gives further improvements.
翻译:实践中使用的学习率调度方案与理论推荐的方案存在显著差异。本研究弥合了理论与实践之间的鸿沟,并由此推导出新的问题自适应学习率调度方法。我们的主要技术贡献是对广泛优化算法(包括SGD)学习率调度方案的精细化分析。在仅考虑最坏情况分析时,我们的理论预测最优选择是线性衰减调度方案,其步长设置为与1 - t/T成正比,其中t为当前迭代次数,T为总步数。为超越这种最坏情况分析,我们利用观测到的梯度范数为特定任务推导精细化调度方案。这些优化方案展现出训练初期的学习率预热阶段和训练末期的快速学习率衰减特性。本研究首次通过系统化方法自动实现这两种特性。我们进行了迄今为止最全面的学习率调度评估,涵盖10个不同的深度学习问题、一系列LLM以及多组逻辑回归问题。验证结果表明,整体而言线性衰减调度方案优于所有常用默认调度(包括余弦退火调度)。我们提出的自适应调度优化方法可带来进一步性能提升。