In recent years, Large Language Models (LLMs) have made significant strides towards Artificial General Intelligence. However, training these models from scratch requires substantial computational resources and vast amounts of text data. In this paper, we explore an alternative approach to constructing an LLM for a new language by continually pretraining (CPT) from existing pretrained LLMs, instead of using randomly initialized parameters. Based on parallel experiments on 40 model sizes ranging from 40M to 5B parameters, we find that 1) CPT converges faster and saves significant resources in a scalable manner; 2) CPT adheres to an extended scaling law derived from Hoffmann et al. (2022) with a joint data-parameter scaling term; 3) The compute-optimal data-parameter allocation for CPT markedly differs based on our estimated scaling factors; 4) The effectiveness of transfer at scale is influenced by training duration and linguistic properties, while robust to data replaying, a method that effectively mitigates catastrophic forgetting in CPT. We hope our findings provide deeper insights into the transferability of LLMs at scale for the research community.
翻译:近年来,大型语言模型(LLMs)在通往通用人工智能的道路上取得了显著进展。然而,从头开始训练这些模型需要大量的计算资源和海量文本数据。本文探讨了一种为新的语言构建LLM的替代方法:基于已有的预训练LLM进行持续预训练(CPT),而非使用随机初始化的参数。基于对参数量从4000万到50亿不等的40种模型规模的并行实验,我们发现:1)CPT能以可扩展的方式更快收敛并节省大量资源;2)CPT遵循由Hoffmann等人(2022)推导出的扩展定律,并包含一个联合数据-参数缩放项;3)根据我们估计的缩放因子,CPT的计算最优数据-参数分配方式存在显著差异;4)大规模迁移的有效性受训练时长和语言特性影响,但对数据回放方法具有鲁棒性,该方法能有效缓解CPT中的灾难性遗忘。我们希望本研究能为学术界深入理解LLM的大规模可迁移性提供更多见解。