Training language models currently requires pre-determining a fixed compute budget because the typical cosine learning rate schedule depends on the total number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can in principle continue indefinitely without a pre-specified compute budget. Then, given any compute budget, one can branch out from the main branch at a proper at any time with a rapidly decaying learning rate to produce a strong model. Empirically, WSD generates a non-traditional loss curve: the loss remains elevated during the stable phase but sharply declines during the decay phase. Towards explaining this phenomenon, we conjecture that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom. Under this assumption, we show that during the stable phase, the iterate undergoes large oscillations due to the high learning rate, yet it progresses swiftly along the river. During the decay phase, the rapidly dropping learning rate minimizes the iterate's oscillations, moving it closer to the river and revealing true optimization progress. Therefore, the sustained high learning rate phase and fast decaying phase are responsible for progress in the river and the mountain directions respectively, and are both critical. Our analysis predicts phenomenons consistent with empirical observations and shows that this landscape can emerge from pretraining on a simple bi-gram dataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch, where we resume from a decayed checkpoint. WSD-S empirically outperforms WSD and Cyclic-Cosine in obtaining multiple language model checkpoints across various compute budgets in a single run for parameters scaling from 0.1B to 1.2B.
翻译:当前训练语言模型需要预先确定固定的计算预算,因为典型的余弦学习率调度依赖于总训练步数。相比之下,预热-稳定-衰减调度采用恒定学习率来生成一个原则上可以无限延续的主干迭代序列,无需预先指定计算预算。随后,在给定任意计算预算时,可以在适当时间点从主干分支出发,通过快速衰减的学习率生成性能强劲的模型。实证研究表明,WSD会产生非传统的损失曲线:在稳定阶段损失值保持高位,而在衰减阶段则急剧下降。为解释这一现象,我们提出预训练损失呈现河流峡谷景观的猜想——该景观类似于底部有河流穿行的深邃峡谷。在此假设下,我们证明在稳定阶段,由于高学习率的作用,迭代点会产生大幅振荡,但同时会沿河流方向快速推进。在衰减阶段,急剧下降的学习率能最小化迭代点的振荡幅度,使其更贴近河床,从而展现出真实的优化进展。因此,持续高学习率阶段与快速衰减阶段分别对应河流方向与峡谷侧壁方向的优化进程,二者均至关重要。我们的理论分析预测了与实证观察一致的现象,并证明该景观可通过简单的二元语法数据集预训练形成。受理论启发,我们提出WSD-S变体——该方案复用历史检查点的衰减阶段,仅保留单一主干分支,并从已完成衰减的检查点恢复训练。实证结果表明,在参数规模0.1B至1.2B的单一训练过程中,WSD-S在获取跨多种计算预算的语言模型检查点方面,其性能优于WSD与循环余弦调度。