Setting the learning rate for a deep learning model is a critical part of successful training, yet choosing this hyperparameter is often done empirically with trial and error. In this work, we explore a solvable model of optimal learning rate schedules for a powerlaw random feature model trained with stochastic gradient descent (SGD). We consider the optimal schedule $η_T^\star(t)$ where $t$ is the current iterate and $T$ is the total training horizon. This schedule is computed both numerically and analytically (when possible) using optimal control methods. Our analysis reveals two regimes which we term the easy phase and hard phase. In the easy phase the optimal schedule is a polynomial decay $η_T^\star(t) \simeq T^{-ξ} (1-t/T)^δ$ where $ξ$ and $δ$ depend on the properties of the features and task. In the hard phase, the optimal schedule resembles warmup-stable-decay with constant (in $T$) initial learning rate and annealing performed over a vanishing (in $T$) fraction of training steps. We investigate joint optimization of learning rate and batch size, identifying a degenerate optimality condition. Our model also predicts the compute-optimal scaling laws (where model size and training steps are chosen optimally) in both easy and hard regimes. Going beyond SGD, we consider optimal schedules for the momentum $β(t)$, where speedups in the hard phase are possible. We compare our optimal schedule to various benchmarks in our task including (1) optimal constant learning rates $η_T(t) \sim T^{-ξ}$ (2) optimal power laws $η_T(t) \sim T^{-ξ} t^{-χ}$, finding that our schedule achieves better rates than either of these. Our theory suggests that learning rate transfer across training horizon depends on the structure of the model and task. We explore these ideas in simple experimental pretraining setups.
翻译:为深度学习模型设置学习率是成功训练的关键环节,然而该超参数的选择通常依赖试错法的经验性调整。本研究通过可解析的随机特征幂律模型,探索了随机梯度下降(SGD)训练中的最优学习率调度问题。我们考虑最优调度方案 $η_T^\star(t)$,其中 $t$ 表示当前迭代步,$T$ 为总训练时域。该调度方案通过最优控制方法进行数值计算与解析推导(在可行情况下)。分析揭示了两种相态:简易相与困难相。在简易相中,最优调度呈现多项式衰减形式 $η_T^\star(t) \simeq T^{-ξ} (1-t/T)^δ$,其中指数 $ξ$ 和 $δ$ 取决于特征与任务属性。在困难相中,最优调度呈现预热-稳定-衰减模式:初始学习率保持与 $T$ 无关的常数值,退火过程仅占据总训练步数中趋于零的比例。我们研究了学习率与批大小的联合优化,发现了退化的最优性条件。该模型同时预测了简易相与困难相中计算最优的缩放规律(即模型规模与训练步数的最优选择)。超越SGD框架,我们进一步探索了动量系数 $β(t)$ 的最优调度方案,在困难相中可能实现加速效果。通过将最优调度方案与多种基准方法比较:(1)最优恒定学习率 $η_T(t) \sim T^{-ξ}$(2)最优幂律调度 $η_T(t) \sim T^{-ξ} t^{-χ}$,发现本研究提出的调度方案能获得更优的收敛速率。理论分析表明,学习率在不同训练时域间的迁移效果取决于模型结构与任务特性。我们在简单的实验性预训练设置中验证了这些理论发现。