Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay

We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent $s>0$ controlling the rate of signal learning, and a capacity exponent $β>1$ determining the rate of noise forgetting. Focusing on a fixed training horizon $N$, we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime $s \ge 1 - 1/β$, the optimal schedule follows a power decay to zero, $η^*(z) = η_{\mathrm{peak}}(1 - z/N)^{2β- 1}$, where the peak learning rate scales as $η_{\mathrm{peak}} \eqsim N^{-ν}$ for an explicit exponent $ν= ν(s,β)$. In contrast, in the hard-task regime $s < 1 - 1/β$, the optimal LRS exhibits a warmup-stable-decay (WSD) (Hu et al. (2024)) structure: it maintains the largest admissible learning rate for most of training and decays only near the end, with the decay phase occupying a vanishing fraction of the horizon. We further analyze optimal shape-fixed schedules, where only the peak learning rate is tuned -- a strategy widely adopted in practiceand characterize their strengths and intrinsic limitations. This yields a principled evaluation of commonly used schedules such as cosine and linear decay. Finally, we apply the power-decay LRS to one-pass stochastic gradient descent (SGD) for kernel regression and show the last iterate attains the exact minimax-optimal rate, eliminating the logarithmic suboptimality present in prior analyses. Numerical experiments corroborate our theoretical predictions.

翻译：我们在Li等人（2025）提出的函数尺度律（FSL）框架下研究最优学习率调度方案，该框架精确刻画了线性回归与大语言模型预训练中的损失动态。在FSL框架内，损失动态由两个指数控制：源指数$s>0$控制信号学习速率，容量指数$β>1$决定噪声遗忘速率。针对固定训练步数$N$，我们推导出最优学习率调度方案，并揭示了一个尖锐的相变现象。在简单任务区域$s \ge 1 - 1/β$中，最优调度遵循趋近于零的幂律衰减：$η^*(z) = η_{\mathrm{peak}}(1 - z/N)^{2β- 1}$，其中峰值学习率按$η_{\mathrm{peak}} \eqsim N^{-ν}$缩放，衰减指数$ν= ν(s,β)$具有显式表达式。相反，在困难任务区域$s < 1 - 1/β$中，最优调度呈现预热-稳定-衰减结构（Hu等人（2024））：在大部分训练时间内保持最大允许学习率，仅在训练末期进行衰减，且衰减阶段占训练总时长的比例趋近于零。我们进一步分析了仅调整峰值学习率的最优形状固定调度方案——这是实践中广泛采用的策略，并刻画了其优势与固有局限性。这为余弦衰减、线性衰减等常用调度方案提供了原理性评估依据。最后，我们将幂律衰减调度应用于核回归的单轮随机梯度下降训练，证明其最终迭代能达到精确的最小最大最优速率，消除了先前分析中存在的对数次优性。数值实验验证了理论预测的正确性。