Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B's (Mistral) performance increases from 11% to 36.7% on MATH and from 36% to 68.4% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large-scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new paradigm for building better instruction tuning data.
翻译:指令微调能够提升大语言模型的推理能力,其中数据质量与可扩展性是关键因素。大多数指令微调数据来源于人类众包或GPT-4蒸馏。本文提出一种新范式,能够从预训练网络语料库中高效收集1000万个自然存在的指令数据,以增强大语言模型的推理能力。我们的方法包括:(1)召回相关文档,(2)提取指令-响应对,以及(3)使用开源大语言模型精炼提取的对。基于该数据集对基础大语言模型进行微调,我们构建了MAmmoTH2系列模型,其在推理基准测试上的性能显著提升。值得注意的是,MAmmoTH2-7B(基于Mistral)在未使用任何领域内数据训练的情况下,在MATH数据集上的性能从11%提升至36.7%,在GSM8K数据集上从36%提升至68.4%。进一步在公开指令微调数据集上训练MAmmoTH2,我们得到了MAmmoTH2-Plus模型,该模型在多个推理和对话机器人基准测试中达到了最先进的性能。我们的工作展示了如何无需昂贵的人工标注或GPT-4蒸馏,即可获取大规模、高质量的指令数据,为构建更优的指令微调数据提供了新范式。