Continual Pre-Training of Large Language Models: How to (re)warm your model?

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Taking a step towards efficient continual pre-training, in this work, we examine the effect of different warm-up strategies. Our hypothesis is that the learning rate must be re-increased to improve compute efficiency when training on a new dataset. We study the warmup phase of models pre-trained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens), following a linear warmup and cosine decay schedule. We conduct all experiments on the Pythia 410M language model architecture and evaluate performance through validation perplexity. We experiment with different pre-training checkpoints, various maximum learning rates, and various warmup lengths. Our results show that while rewarming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch$\unicode{x2013}$even for a large downstream dataset.

翻译：大型语言模型（LLMs）通常在数十亿个token上进行预训练，但一旦新数据可用，又不得不重新开始这一过程。一种更经济高效的解决方案是使这些模型能够持续预训练，即用新数据更新预训练模型，而不是从头开始重新训练。然而，新数据带来的分布偏移通常会导致模型在旧数据上的性能下降。为了迈向高效的持续预训练，本文研究了不同预热策略的效果。我们假设，在新数据集上训练时，必须重新提高学习率以提升计算效率。我们研究了在Pile（上游数据，300B token）上预训练的模型，在继续用SlimPajama（下游数据，297B token）进行预训练时的预热阶段，采用线性预热和余弦衰减调度策略。所有实验均基于Pythia 410M语言模型架构，并通过验证集困惑度评估性能。我们尝试了不同的预训练检查点、各种最大学习率以及不同的预热长度。结果表明，虽然重新预热模型最初会导致上下游数据上的损失增加，但从长期来看，它能提升下游性能，甚至在使用大型下游数据集时，其表现也优于从头训练的模型。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日