Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by final loss and language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.
翻译:大语言模型(LLMs)通常会在数十亿个token上进行预训练,一旦新数据可用,便需重新开始整个训练流程。一种更高效的解决方案是持续预训练这些模型,相比重新训练可显著节省计算资源。然而,新数据带来的分布偏移通常会导致模型在旧数据上性能下降,或对新数据的适应能力不足。在本工作中,我们证明:学习率(LR)重新预热、LR重新衰减以及旧数据回放的简单且可扩展组合,足以在最终损失和语言模型(LM)评估基准上匹配从头开始完全重新训练所有可用数据的性能。具体而言,我们在两个常用LLM预训练数据集之间的弱但真实的分布偏移(英语→英语)以及更强的分布偏移(英语→德语)下,以4.05亿参数模型规模和大型数据集规模(数千亿token)验证了该结论。针对更大规模实验选择弱但真实的分布偏移时,我们进一步发现,持续学习策略同样能匹配100亿参数LLM的重新训练基线。结果表明,通过简单且可扩展的持续学习策略,LLM可成功完成更新,且仅需使用重新训练基线计算资源的一小部分。最后,受先前工作启发,我们提出了余弦学习率调度策略的替代方案,该方案有助于规避因LR重新预热引起的遗忘问题,且不依赖于固定的token预算。