As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textit{Can we leverage existing small pretrained models to accelerate the training of larger models?} In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6$\times$ speedup with nearly 5\% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10$\times$ fewer parameters than the target model.
翻译:随着大语言模型(LLMs)通过扩展模型和数据规模取得显著的实证成功,预训练已变得日益关键但计算成本极高,阻碍了快速发展。尽管已有大量以巨大计算代价开发的预训练LLMs可用,一个根本的现实问题仍未得到充分探索:\textit{我们能否利用现有的小型预训练模型来加速更大模型的训练?} 本文提出一种晚至早训练范式,使LLMs能够在早期步骤和早期层显式学习后期知识。其核心思想是:在早期训练阶段,利用一个预训练(即处于后期训练阶段)模型的深层表征来指导LLM的浅层。我们识别了驱动LET有效性的两个关键机制:晚至早步学习和晚至早层学习。这些机制显著加速了训练收敛,同时稳健地提升了语言建模能力和下游任务性能,实现了更快的训练与更优的性能。在14亿和70亿参数模型上的大量实验证明了LET的效率和有效性。值得注意的是,在Pile数据集上训练14亿参数的LLM时,即使使用参数量仅为目标模型十分之一的预训练模型,我们的方法相比标准训练仍能实现高达1.6倍的加速,且下游任务准确率提升近5%。