Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.
翻译:摘要:大型语言模型(LLMs)已突破自然语言理解的极限,展现出卓越的问题求解能力。尽管取得了巨大成功,但现有大多数开源大型语言模型(如LLaMA-2)因涉及复杂推理过程,在解决数学问题时仍远未达到令人满意的水平。为弥补这一差距,我们提出MetaMath——一个专精于数学推理的微调语言模型。具体而言,我们首先通过从多角度改写问题来引导数学问题的自动生成,无需额外知识,由此形成新数据集MetaMathQA。随后在MetaMathQA上对LLaMA-2模型进行微调。在数学推理领域的两个经典基准(GSM8K和MATH)上的实验结果表明,MetaMath以显著优势超越了一系列开源语言模型。我们的MetaMath-7B模型在GSM8K上达到66.4%的准确率,在MATH上达到19.4%,分别超过同尺寸最先进模型11.5%和8.7%。尤为突出的是,MetaMath-70B在GSM8K上实现了82.3%的准确率,略优于GPT-3.5-Turbo。我们公开发布MetaMathQA数据集、不同参数规模的MetaMath模型及训练代码供公众使用。