As language models have scaled both their number of parameters and pretraining dataset sizes, the computational cost for pretraining has become intractable except for the most well-resourced teams. This increasing cost makes it ever more important to be able to reuse a model after it has completed pretraining; allowing for a model's abilities to further improve without needing to train from scratch. In this work, we detail a set of guidelines that cover how to design efficacious data distributions and learning rate schedules for continued pretraining of language models. When applying these findings within a continued pretraining run on top of a well-trained 15B parameter model, we show an improvement of 9\% in average model accuracy compared to the baseline of continued training on the pretraining set. The resulting recipe provides a practical starting point with which to begin developing language models through reuse rather than retraining.
翻译:随着语言模型在参数量和预训练数据集规模上的不断扩展,预训练的计算成本已变得难以承受,只有资源最充足的团队才能承担。这种日益增长的成本使得在模型完成预训练后能够对其进行重用变得尤为重要;这使得模型的能力可以在无需从头训练的情况下得到进一步提升。在本工作中,我们详述了一套指导原则,涵盖了如何为语言模型的持续预训练设计高效的数据分布和学习率调度策略。当将这些发现应用于一个训练良好的150亿参数模型之上进行持续预训练时,我们展示了相较于在预训练集上进行持续训练的基线方法,模型平均准确率提升了9%。该方案为通过重用而非重训来开发语言模型提供了一个实用的起点。