最优学习率调度方案在函数缩放律框架下的研究：幂律衰减与预热-稳定-衰减模式 (Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay)

We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent $s>0$ controlling the rate of signal learning, and a capacity exponent $β>1$ determining the rate of noise forgetting. Focusing on a fixed training horizon $N$, we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime $s \ge 1 - 1/β$, the optimal schedule follows a power decay to zero, $η^*(z) = η_{\mathrm{peak}}(1 - z/N)^{2β- 1}$, where the peak learning rate scales as $η_{\mathrm{peak}} \eqsim N^{-ν}$ for an explicit exponent $ν= ν(s,β)$. In contrast, in the hard-task regime $s < 1 - 1/β$, the optimal LRS exhibits a warmup-stable-decay (WSD) (Hu et al. (2024)) structure: it maintains the largest admissible learning rate for most of training and decays only near the end, with the decay phase occupying a vanishing fraction of the horizon. We further analyze optimal shape-fixed schedules, where only the peak learning rate is tuned -- a strategy widely adopted in practiceand characterize their strengths and intrinsic limitations. This yields a principled evaluation of commonly used schedules such as cosine and linear decay. Finally, we apply the power-decay LRS to one-pass stochastic gradient descent (SGD) for kernel regression and show the last iterate attains the exact minimax-optimal rate, eliminating the logarithmic suboptimality present in prior analyses. Numerical experiments corroborate our theoretical predictions.

翻译：本研究基于Li等人（2025）提出的函数缩放律（FSL）框架探讨最优学习率调度方案（LRS），该框架能精确刻画线性回归与大语言模型（LLM）预训练中的损失动态。在FSL框架下，损失动态由两个指数控制：源指数 $s>0$ 调控信号学习速率，容量指数 $β>1$ 决定噪声遗忘速率。针对固定训练步数 $N$，我们推导出最优LRS并揭示出尖锐的相变现象。在简单任务区域 $s \ge 1 - 1/β$ 中，最优调度呈现趋零的幂律衰减形式 $η^*(z) = η_{\mathrm{peak}}(1 - z/N)^{2β- 1}$，其中峰值学习率按 $η_{\mathrm{peak}} \eqsim N^{-ν}$ 缩放，衰减指数 $ν= ν(s,β)$ 具有显式表达式。与之相对，在困难任务区域 $s < 1 - 1/β$ 中，最优LRS展现出预热-稳定-衰减（WSD）结构（Hu等人（2024））：在绝大部分训练时间内保持最大允许学习率，仅在训练末期进行衰减，且衰减阶段所占训练时长比例渐近趋于零。我们进一步分析了仅调节峰值学习率的最优形状固定调度方案——这种在实践中被广泛采用的策略，并刻画了其优势与内在局限性。这为评估余弦衰减、线性衰减等常用调度方案提供了理论依据。最后，我们将幂律衰减LRS应用于核回归的单轮随机梯度下降（SGD），证明其最终迭代能达到精确的极小极大最优速率，消除了先前分析中存在的对数次优项。数值实验验证了理论预测的有效性。