Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective

Training language models currently requires pre-determining a fixed compute budget because the typical cosine learning rate schedule depends on the total number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can in principle continue indefinitely without a pre-specified compute budget. Then, given any compute budget, one can branch out from the main branch at a proper time with a rapidly decaying learning rate to produce a strong model. Empirically, WSD generates a non-traditional loss curve: the loss remains elevated during the stable phase but sharply declines during the decay phase. Towards explaining this phenomenon, we conjecture that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom. Under this assumption, we show that during the stable phase, the iterate undergoes large oscillations due to the high learning rate, yet it progresses swiftly along the river. During the decay phase, the rapidly dropping learning rate minimizes the iterate's oscillations, moving it closer to the river and revealing true optimization progress. Therefore, the sustained high learning rate phase and fast decaying phase are responsible for progress in the river and the mountain directions respectively, and are both critical. Our analysis predicts phenomenons consistent with empirical observations and shows that this landscape can emerge from pretraining on a simple bi-gram dataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch, where we resume from a decayed checkpoint. WSD-S empirically outperforms WSD and Cyclic-Cosine in obtaining multiple language model checkpoints across various compute budgets in a single run for parameters scaling from 0.1B to 1.2B.

翻译：当前训练语言模型需要预先确定固定的计算预算，因为典型的余弦学习率调度依赖于总步数。相比之下，Warmup-Stable-Decay（WSD）调度采用恒定学习率生成一条原则上可无限延续的主迭代分支，无需预先指定计算预算。随后，在给定任意计算预算时，可以从主分支在适当时机分支出一个快速衰减学习率的子分支，从而生成性能优异的模型。实证研究表明，WSD会产生非传统的损失曲线：损失值在稳定阶段保持高位，而在衰减阶段急剧下降。为解释这一现象，我们提出预训练损失呈现河流山谷景观的猜想，该景观类似于底部有河流的深邃山谷。在此假设下，我们证明在稳定阶段，由于高学习率的作用，迭代点会产生大幅振荡，但同时会沿河流方向快速推进。在衰减阶段，急剧下降的学习率能最小化迭代点的振荡，使其更接近河流，从而展现真实的优化进展。因此，持续高学习率阶段与快速衰减阶段分别负责河流方向与山谷方向的进展，两者均至关重要。我们的分析预测了与实证观察一致的现象，并证明该景观可通过简单的双词元数据集预训练形成。受理论启发，我们提出WSD-S——WSD的变体方法，该方法复用历史检查点的衰减阶段，仅保留单一主分支，并从已衰减的检查点恢复训练。实证结果表明，在单次训练运行中获取参数规模从0.1B到1.2B的多语言模型检查点时，WSD-S在跨不同计算预算的场景下均优于WSD和Cyclic-Cosine方法。