Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters. There are two approaches of mitigating such computational demands: reusing smaller models to train larger ones (upcycling), and training computationally efficient models like mixture-of-experts (MoE). In this paper, we study the upcycling of LLMs to MoE models, of which the scaling behavior remains underexplored. Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration. Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets. Based on these findings, we provide guidance to scale upcycling, and establish conditions under which upcycling outperforms from-scratch trainings within budget constraints.
翻译:预训练大型语言模型(LLMs)需要大量资源,即使使用高端GPU集群也常需数月训练时间。目前有两种降低计算需求的方法:复用较小模型以训练更大模型(升级),以及训练计算高效的模型如混合专家(MoE)。本文研究将LLMs升级为MoE模型的过程,其规模扩展行为尚未得到充分探索。通过大量实验,我们发现了描述性能如何依赖于数据集规模和模型配置的经验性规模法则。特别地,我们证明虽然扩展这些因素能提升性能,但稠密训练数据集与升级训练数据集之间存在新的交互项,限制了高计算预算下升级的效率。基于这些发现,我们为升级扩展提供指导,并确立了在预算约束下升级优于从头开始训练的条件。