Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B's (Mistral) performance increases from 11% to 34% on MATH and from 36% to 67% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large-scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new paradigm for building better instruction tuning data.
翻译:指令微调能够提升大语言模型的推理能力,而数据质量和可扩展性是关键因素。当前大多数指令微调数据源于人工众包或GPT-4蒸馏。我们提出一种新范式,能从预训练网络语料库中高效获取1000万个天然存在的指令数据,以增强大语言模型的推理能力。该方法包括:(1)召回相关文档,(2)提取指令-响应对,(3)利用开源大语言模型优化提取的配对数据。在此数据集上微调基础大语言模型,我们构建了MAmmoTH2模型系列,显著提升了推理基准性能。值得注意的是,MAmmoTH2-7B(Mistral)在未经任何领域内数据训练的情况下,MATH得分从11%提升至34%,GSM8K得分从36%提升至67%。进一步在公开指令微调数据集上训练MAmmoTH2,得到MAmmoTH2-Plus,在多个推理与聊天机器人基准中达到最优性能。本研究展示了如何在不依赖昂贵人工标注或GPT-4蒸馏的情况下获取大规模高质量指令数据,为构建更优指令微调数据提供了新范式。