Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Taking a step towards efficient continual pre-training, in this work, we examine the effect of different warm-up strategies. Our hypothesis is that the learning rate must be re-increased to improve compute efficiency when training on a new dataset. We study the warmup phase of models pre-trained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens), following a linear warmup and cosine decay schedule. We conduct all experiments on the Pythia 410M language model architecture and evaluate performance through validation perplexity. We experiment with different pre-training checkpoints, various maximum learning rates, and various warmup lengths. Our results show that while rewarming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch$\unicode{x2013}$even for a large downstream dataset.
翻译:大型语言模型(LLMs)通常会在数十亿词元上进行预训练,但一旦新数据可用,就不得不从头开始重复这一过程。一种更经济高效的解决方案是让这些模型能够持续预训练,即利用新数据更新预训练模型而非从头重新训练。然而,新数据引入的分布偏移通常会导致模型在旧数据上的性能下降。为迈向高效的持续预训练,本文研究了不同预热策略的效果。我们假设,在新数据集上训练时,必须重新提高学习率以提升计算效率。我们研究了在Pile(上游数据,300B词元)上预训练的模型,在采用线性预热和余弦衰减调度、继续对SlimPajama(下游数据,297B词元)进行预训练时的预热阶段。所有实验均基于Pythia 410M语言模型架构,并通过验证集困惑度评估性能。我们尝试了不同的预训练检查点、不同最大学习率以及不同预热长度。结果表明,虽然重新加热模型会先导致上游和下游数据的损失增加,但从长远来看,它能提升下游性能,甚至超越从零开始训练的模型——即便面对大规模下游数据集也是如此。