Simple and Scalable Strategies to Continually Pre-train Large Language Models

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by final loss and language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.

翻译：大语言模型（LLMs）通常会在数十亿个token上进行预训练，一旦新数据可用，便需重新开始整个训练流程。一种更高效的解决方案是持续预训练这些模型，相比重新训练可显著节省计算资源。然而，新数据带来的分布偏移通常会导致模型在旧数据上性能下降，或对新数据的适应能力不足。在本工作中，我们证明：学习率（LR）重新预热、LR重新衰减以及旧数据回放的简单且可扩展组合，足以在最终损失和语言模型（LM）评估基准上匹配从头开始完全重新训练所有可用数据的性能。具体而言，我们在两个常用LLM预训练数据集之间的弱但真实的分布偏移（英语→英语）以及更强的分布偏移（英语→德语）下，以4.05亿参数模型规模和大型数据集规模（数千亿token）验证了该结论。针对更大规模实验选择弱但真实的分布偏移时，我们进一步发现，持续学习策略同样能匹配100亿参数LLM的重新训练基线。结果表明，通过简单且可扩展的持续学习策略，LLM可成功完成更新，且仅需使用重新训练基线计算资源的一小部分。最后，受先前工作启发，我们提出了余弦学习率调度策略的替代方案，该方案有助于规避因LR重新预热引起的遗忘问题，且不依赖于固定的token预算。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日