Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data using frontier LLMs (e.g., {\tt GPT-3.5}). Inspired by the cognitive mechanism in human mathematical learning, it first extracts topics and knowledge points from seed math questions and then build a concept graph, which is subsequently used to generate new math questions. MathScale exhibits effective scalability along the size axis of the math dataset that we generate. As a result, we create a mathematical reasoning dataset (MathScaleQA) containing two million math question-answer pairs. To evaluate mathematical reasoning abilities of LLMs comprehensively, we construct {\sc MwpBench}, a benchmark of Math Word Problems, which is a collection of ten datasets (including GSM8K and MATH) covering K-12, college, and competition level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e.g., LLaMA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on {\sc MwpBench}, MathScale-7B achieves state-of-the-art performance across all datasets, surpassing its best peers of equivalent size by 42.9\% in micro average accuracy and 43.7\% in macro average accuracy, respectively.
翻译:大型语言模型在问题求解中展现了卓越能力,但其解决数学问题的性能仍显不足。我们提出MathScale——一种利用前沿大语言模型(如{\tt GPT-3.5})生成高质量数学推理数据的简洁可扩展方法。受人类数学学习认知机制启发,该方法首先从种子数学题中提取主题与知识点,构建概念图谱,进而基于该图谱生成新数学题。MathScale在生成数据集的规模维度上展现出高效可扩展性,据此我们构建了包含两百万数学问答对的数学推理数据集MathScaleQA。为全面评估大语言模型的数学推理能力,我们整合十个覆盖K-12、大学及竞赛层级数学问题的数据集(含GSM8K与MATH),构建了数学应用题基准测试集{\sc MwpBench}。通过MathScaleQA对开源大语言模型(如LLaMA-2与Mistral)进行微调后,其数学推理能力获得显著提升。在{\sc MwpBench}评测中,MathScale-7B在所有数据集上均达到最优性能,其微观平均准确率与宏观平均准确率分别超越同等规模最佳模型42.9%与43.7%。