Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

Setting the learning rate (LR) for a deep learning model is a critical part of successful training. Choosing LRs is often done empirically with trial and error. In this work, we explore a solvable model of optimal LR schedules for a powerlaw random feature model trained with stochastic gradient descent (SGD). We consider the optimal schedule $η_T^\star(t)$ where $t$ is the current iterate and $T$ is the training horizon. This schedule is computed both as a numerical optimization problem and also analytically using optimal control theory. Our analysis reveals two regimes which we term the easy phase and hard phase. In the easy phase the optimal schedule is a polynomial decay $η_T^\star(t) \simeq T^{-ξ} (1-t/T)^δ$ where $ξ$ and $δ$ depend on the properties of the features and task. In the hard phase, the optimal schedule resembles warmup-stable-decay with constant initial LR and annealing performed over a vanishing fraction of training steps. We investigate joint optimization of LR and batch size and find batch ramps can improve the wall-clock time in the easy phase. Beyond SGD, we derive optimal schedules for momentum parameter $β(t)$ and show that it improves the loss-scaling exponent in the hard phase. We compare our optimal schedule to various benchmarks including (1) optimal constant learning rates $η_T(t) \sim T^{-ξ}$ (2) optimal power laws $η_T(t) \sim T^{-ξ} t^{-χ}$, finding that our schedule achieves better rates than either of these. Our theory suggests that LR transfer across training horizon depends on the structure of the model and task. For ResNet image classification on CIFAR-5M, the learning curves exhibit hard-phase behavior where optimal base LRs are constant under sufficient annealing. GPT-2 style transformers trained in language modeling exhibit easy-phase behavior where optimal LRs shift even under annealing.

翻译：设定深度学习模型的学习率（LR）是成功训练的关键环节，而LR的选择往往依赖经验试错。本研究探索了基于随机梯度下降（SGD）训练的幂律随机特征模型的最优LR调度可解模型。我们考虑最优调度 $η_T^\star(t)$，其中 $t$ 为当前迭代步数，$T$ 为训练周期。该调度既可通过数值优化问题计算，也可利用最优控制理论进行解析求解。分析揭示出两种阶段：我们将其称为“易阶段”与“难阶段”。在易阶段，最优调度呈多项式衰减形式 $η_T^\star(t) \simeq T^{-ξ} (1-t/T)^δ$，其中 $ξ$ 和 $δ$ 取决于特征与任务属性。在难阶段，最优调度类似于“预热-稳定-衰减”模式：初始LR保持恒定，退火仅发生在训练步数的消失比例内。我们进一步研究了LR与批大小的联合优化，发现批大小斜坡（batch ramps）可改善易阶段的实际时间开销。除SGD外，我们还推导了动量参数 $β(t)$ 的最优调度，并证明其能提升难阶段的损失标度指数。将所提最优调度与多种基准进行对比，包括：(1) 最优恒定学习率 $η_T(t) \sim T^{-ξ}$，(2) 最优幂律 $η_T(t) \sim T^{-ξ} t^{-χ}$，结果表明我们的调度能达到更优的衰减速率。理论表明，LR跨训练周期的迁移能力取决于模型与任务的结构。在CIFAR-5M数据集上进行ResNet图像分类时，学习曲线呈现难阶段行为：在充分退火条件下，最优基准LR保持恒定。而在语言建模中训练的GPT-2风格Transformer则呈现易阶段行为：即使经过退火处理，最优LR仍会发生偏移。