Training language models currently requires pre-determining a fixed compute budget because the typical cosine learning rate schedule depends on the total number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can in principle continue indefinitely without a pre-specified compute budget. Then, given any compute budget, one can branch out from the main branch at a proper at any time with a rapidly decaying learning rate to produce a strong model. Empirically, WSD generates a non-traditional loss curve: the loss remains elevated during the stable phase but sharply declines during the decay phase. Towards explaining this phenomenon, we conjecture that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom. Under this assumption, we show that during the stable phase, the iterate undergoes large oscillations due to the high learning rate, yet it progresses swiftly along the river. During the decay phase, the rapidly dropping learning rate minimizes the iterate's oscillations, moving it closer to the river and revealing true optimization progress. Therefore, the sustained high learning rate phase and fast decaying phase are responsible for progress in the river and the mountain directions respectively, and are both critical. Our analysis predicts phenomenons consistent with empirical observations and shows that this landscape can emerge from pretraining on a simple bi-gram dataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch, where we resume from a decayed checkpoint. WSD-S empirically outperforms WSD and Cyclic-Cosine in obtaining multiple language model checkpoints across various compute budgets in a single run for parameters scaling from 0.1B to 1.2B.
翻译:当前训练语言模型需要预先确定固定的计算预算,因为典型的余弦学习率调度依赖于总步数。相比之下,预热-稳定-衰减调度使用恒定学习率生成一个原则上可以无限持续且无需预先指定计算预算的主迭代分支。随后,给定任意计算预算,研究者可在任意时间点从主分支以快速衰减的学习率分支出一个强性能模型。实证表明,WSD会产生非传统的损失曲线:损失在稳定阶段保持高位,但在衰减阶段急剧下降。为解释此现象,我们推测预训练损失呈现河流谷地景观,其形态类似于底部有河流的深谷。在此假设下,我们证明在稳定阶段,由于高学习率,迭代点会产生大幅振荡,但仍能沿河流快速推进。在衰减阶段,快速下降的学习率最小化迭代点的振荡,使其更靠近河流并展现真实的优化进展。因此,持续高学习率阶段与快速衰减阶段分别负责河流方向与山谷方向的进展,二者均至关重要。我们的分析预测了与实证观察一致的现象,并证明该景观可通过简单双词数据集上的预训练形成。受理论启发,我们提出WSD-S——一种复用历史检查点衰减阶段、仅保留单一主分支的WSD变体,其中我们从衰减后的检查点恢复训练。实证表明,在单次运行中为0.1B至1.2B参数规模的语言模型获取跨多种计算预算的检查点时,WSD-S在性能上优于WSD与循环余弦调度。