Reusing pretrained base models for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, the effectiveness remains unclear, especially when applied to overtrained base models. In this work, we empirically study the scaling properties of model reuse and find that the scaling efficiency diminishes in a predictable manner: The scaling exponent with respect to second-stage training tokens decreases logarithmically with the number of tokens used to pretrain the base model. The joint dependence on first- and second-stage tokens is accurately modeled by a simple scaling law. Such saturation effect reveals a fundamental trade-off in multi-stage pretraining strategies: the more extensively a base model is pretrained, the less benefit additional pretraining provides. Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.
翻译:重用预训练基础模型进行进一步预训练(例如持续预训练或模型增长)有望降低从头训练语言模型的成本。然而,其有效性尚不明确,尤其是在应用于过度训练的基础模型时。在本工作中,我们通过实证研究模型重用的扩展特性,发现扩展效率以可预测的方式递减:相对于第二阶段训练词元的扩展指数,会随着用于预训练基础模型的词元数量呈对数下降。这种对第一阶段和第二阶段词元的联合依赖关系可以通过一个简单的扩展定律精确建模。这种饱和效应揭示了多阶段预训练策略中的一个基本权衡:基础模型预训练得越充分,额外预训练带来的收益就越少。我们的研究结果为高效语言模型训练提供了实用见解,并对过度训练模型的重用提出了重要考量。