Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.
翻译:无监督多任务预训练一直是语言模型近期成功背后的关键方法。然而,监督式多任务学习仍具有重要前景,因为其在后训练阶段的规模化趋势能带来更好的泛化能力。本文通过提出指令预训练来探索监督式多任务预训练,该框架通过指令-响应对大规模扩展原始语料库以预训练语言模型。指令-响应对由基于开源模型构建的高效指令合成器生成。实验中,我们合成了覆盖40余个任务类别的2亿条指令-响应对,以验证指令预训练的有效性。在从头预训练中,指令预训练不仅持续提升预训练基础模型,还能从进一步的指令微调中获益更多。在持续预训练中,指令预训练使Llama3-8B达到甚至超越Llama3-70B的性能。我们的模型、代码与数据公开于https://github.com/microsoft/LMOps。