Training LLMs is expensive, and recent evidence indicates training all the way to convergence is inefficient. In this paper, we investigate the ability of a simple idea, checkpoint averaging along the trajectory of a training run to improve the quality of models before they have converged. This approach incurs no extra cost during training or inference. Specifically, we analyze the training trajectories of Pythia LLMs with 1 to 12 billion parameters and demonstrate that, particularly during the early to mid stages of training, this idea accelerates convergence and improves both test and zero-shot generalization. Loss spikes are a well recognized problem in LLM training; in our analysis we encountered two instances of this in the underlying trajectories, and both instances were mitigated by our averaging. For a 6.9B parameter LLM, for example, our early weight averaging recipe can save upto 4200 hours of GPU time, which corresponds to significant savings in cloud compute costs.
翻译:训练大语言模型(LLM)成本高昂,近期证据表明,将模型完全训练至收敛效率低下。本文研究了一个简单思路——沿训练轨迹进行检查点平均(checkpoint averaging)——在模型收敛前提升其质量的能力。该方法在训练或推理过程中不增加额外成本。具体而言,我们分析了参数量为10亿至120亿的Pythia LLM的训练轨迹,并证明:尤其在训练的早期至中期阶段,该思路能加速收敛,提升测试与零样本泛化性能。损失尖峰(loss spikes)是LLM训练中公认的问题;本分析中在基础轨迹上遇到了两个实例,而我们的平均方法均有效缓解了这两者。以69亿参数的LLM为例,我们的早期权重平均策略可节省多达4200小时的GPU时间,对应显著的云计算成本节省。