Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by final loss and language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.
翻译:大型语言模型通常会消耗数十亿个token进行预训练,然而一旦新数据出现,便需要从头开始重新训练。更高效的解决方案是持续预训练这些模型,相较于重新训练可显著节省计算资源。然而,新数据带来的分布偏移通常会导致模型在原有数据上性能下降,或无法有效适应新数据。本研究表明,通过结合学习率重新预热、学习率重新衰减以及对历史数据进行回放这一简单且可扩展的策略组合,即可在最终损失和语言模型评估基准上达到与在所有可用数据上完全重新训练相媲美的性能。具体而言,我们在两个常用LLM预训练数据集之间(英语→英语)的弱现实分布偏移情景下,以及在更大规模数据(数千亿token)的405M参数模型上的强分布偏移情景(英语→德语)中验证了这一点。针对更大规模实验,我们选择弱现实分布偏移场景发现,对于10B参数LLM,我们的持续学习策略仍能匹配重新训练基线。结果表明,通过简单且可扩展的持续学习策略即可成功更新LLM,仅需消耗部分计算资源便可达到与重新训练基线相当的效果。最后,受先前研究启发,我们提出可替代余弦学习率调度的方案,该方案能规避学习率重新预热导致的遗忘问题,且不受固定token预算限制。