Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (\eg, LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose \emph{MetaMath}, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called {MetaMathQA}. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (\ie, GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves $66.4\%$ on GSM8K and $19.4\%$ on MATH, exceeding the state-of-the-art models of the same size by $11.5\%$ and $8.7\%$. Particularly, {MetaMath-70B} achieves an accuracy of $82.3\%$ on {GSM8K}, slightly better than {GPT-3.5-Turbo}. We release the {MetaMathQA} dataset, the {MetaMath} models with different model sizes and the training code for public use.
翻译:大语言模型(LLMs)已突破自然语言理解的极限,展现出卓越的问题求解能力。尽管取得了巨大成功,但由于数学问题涉及复杂的推理步骤,现有开源大语言模型(如LLaMA-2)在数学问题求解方面仍远未达到令人满意的水平。为弥补这一差距,我们提出\textit{MetaMath}——一个专注于数学推理的微调语言模型。具体而言,我们首先通过多角度改写问题来自举数学问题(无需额外知识),由此构建新数据集{MetaMathQA},随后在MetaMathQA上对LLaMA-2模型进行微调。在数学推理领域两个主流基准(即GSM8K和MATH)上的实验结果表明,MetaMath显著优于一系列开源大语言模型。我们的MetaMath-7B模型在GSM8K上达到$66.4\%$的准确率,在MATH上达到$19.4\%$,分别超越同规模最先进模型$11.5\%$和$8.7\%$。值得注意的是,{MetaMath-70B}在{GSM8K}上实现$82.3\%$的准确率,略优于{GPT-3.5-Turbo}。我们公开发布{MetaMathQA}数据集、不同规模的{MetaMath}模型及训练代码。