Continual Pre-Training of Large Language Models: How to (re)warm your model?

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Taking a step towards efficient continual pre-training, in this work, we examine the effect of different warm-up strategies. Our hypothesis is that the learning rate must be re-increased to improve compute efficiency when training on a new dataset. We study the warmup phase of models pre-trained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens), following a linear warmup and cosine decay schedule. We conduct all experiments on the Pythia 410M language model architecture and evaluate performance through validation perplexity. We experiment with different pre-training checkpoints, various maximum learning rates, and various warmup lengths. Our results show that while rewarming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch$\unicode{x2013}$even for a large downstream dataset.

翻译：大型语言模型（LLMs）通常会在数十亿词元上进行预训练，但一旦新数据可用，就不得不从头开始重复这一过程。一种更经济高效的解决方案是让这些模型能够持续预训练，即利用新数据更新预训练模型而非从头重新训练。然而，新数据引入的分布偏移通常会导致模型在旧数据上的性能下降。为迈向高效的持续预训练，本文研究了不同预热策略的效果。我们假设，在新数据集上训练时，必须重新提高学习率以提升计算效率。我们研究了在Pile（上游数据，300B词元）上预训练的模型，在采用线性预热和余弦衰减调度、继续对SlimPajama（下游数据，297B词元）进行预训练时的预热阶段。所有实验均基于Pythia 410M语言模型架构，并通过验证集困惑度评估性能。我们尝试了不同的预训练检查点、不同最大学习率以及不同预热长度。结果表明，虽然重新加热模型会先导致上游和下游数据的损失增加，但从长远来看，它能提升下游性能，甚至超越从零开始训练的模型——即便面对大规模下游数据集也是如此。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日