Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.
翻译:大型语言模型在持续增长的标记预算下进行预训练,其假设是更好的预训练性能会转化为更优的下游模型。在本研究中,我们挑战了这一假设,并证明延长预训练时间会使模型更难微调,导致最终性能下降。我们将此现象称为灾难性过度训练。例如,在3T标记上预训练后经指令调优的OLMo-1B模型,在多个标准LLM基准测试中的性能比其2.3T标记版本低2%以上。通过受控实验和理论分析,我们证明灾难性过度训练源于预训练参数对修改(包括但不限于微调)的广泛敏感性系统性增加。我们的研究结果呼吁对预训练设计进行批判性重估,需将模型的下游适应性纳入考量。