Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B's (Mistral) performance increases from 11% to 34% on MATH and from 36% to 67% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large-scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new paradigm for building better instruction tuning data.
翻译:指令微调能够提升大型语言模型(LLMs)的推理能力,其中数据质量与可扩展性是关键因素。现有指令微调数据主要依赖于人工众包或GPT-4蒸馏。我们提出一种新范式,通过从预训练网络语料库中高效采集千万级自然存在的指令数据以增强LLM推理能力。该方法包括:(1) 检索相关文档,(2) 提取指令-响应对,(3) 利用开源LLM优化提取的对。基于该数据集对基础LLM进行微调,我们构建了MAmmoTH2模型,其在推理基准测试中性能显著提升。值得注意的是,MAmmoTH2-7B(Mistral)在不使用任何领域内数据训练的情况下,MATH性能从11%提升至34%,GSM8K从36%提升至67%。进一步在公开指令微调数据集上训练MAmmoTH2得到MAmmoTH2-Plus,在多项推理与聊天机器人基准测试中达到最优性能。本研究展示了无需昂贵人工标注或GPT-4蒸馏即可高效获取大规模高质量指令数据的路径,为构建更优指令微调数据提供了新范式。