Large language models are increasingly trained in continual or open-ended settings, where the total training horizon is not known in advance. Despite this, most existing pretraining recipes are not anytime: they rely on horizon-dependent learning rate schedules and extensive tuning under a fixed compute budget. In this work, we provide a theoretical analysis demonstrating the existence of anytime learning schedules for overparameterized linear regression, and we highlight the central role of weight averaging - also known as model merging - in achieving the minimax convergence rates of stochastic gradient descent. We show that these anytime schedules polynomially decay with time, with the decay rate determined by the source and capacity conditions of the problem. Empirically, we evaluate 150M and 300M parameter language models trained at 1-32x Chinchilla scale, comparing constant learning rates with weight averaging and $1/\sqrt{t}$ schedules with weight averaging against a well-tuned cosine schedule. Across the full training range, the anytime schedules achieve comparable final loss to cosine decay. Taken together, our results suggest that weight averaging combined with simple, horizon-free step sizes offers a practical and effective anytime alternative to cosine learning rate schedules for large language model pretraining.
翻译:大型语言模型越来越多地在持续或开放环境下进行训练,其中总训练时长无法预先确定。尽管如此,现有的大多数预训练方案并非任意时间可用的:它们依赖于与时间界限相关的学习率调度,并在固定计算预算下进行大量调优。本工作通过理论分析证明了过参数化线性回归中存在任意时间学习调度方案,并强调了权重平均(亦称模型融合)在实现随机梯度下降极小极大收敛速率中的核心作用。我们证明这些任意时间调度方案随时间呈多项式衰减,衰减速率由问题的源条件与容量条件共同决定。在实证研究中,我们评估了参数量为1.5亿和3亿的语言模型在1-32倍Chinchilla规模下的训练效果,将恒定学习率结合权重平均的方案、$1/\sqrt{t}$调度结合权重平均的方案与经过充分调优的余弦衰减调度进行对比。在整个训练区间内,任意时间调度方案最终达到的损失值与余弦衰减方案相当。综合而言,我们的研究结果表明:权重平均结合简单且无时间界限的步长设置,为大型语言模型预训练提供了一种实用有效的任意时间替代方案,可取代传统的余弦学习率调度。