The tool-use Large Language Models (LLMs) that integrate with external Python interpreters have significantly enhanced mathematical reasoning capabilities for open-source LLMs, while tool-free methods chose another track: augmenting math reasoning data. However, a great method to integrate the above two research paths and combine their advantages remains to be explored. In this work, we firstly include new math questions via multi-perspective data augmenting methods and then synthesize code-nested solutions to them. The open LLMs (i.e., Llama-2) are finetuned on the augmented dataset to get the resulting models, MuMath-Code ($\mu$-Math-Code). During the inference phase, our MuMath-Code generates code and interacts with the external python interpreter to get the execution results. Therefore, MuMath-Code leverages the advantages of both the external tool and data augmentation. To fully leverage the advantages of our augmented data, we propose a two-stage training strategy: In Stage-1, we finetune Llama-2 on pure CoT data to get an intermediate model, which then is trained on the code-nested data in Stage-2 to get the resulting MuMath-Code. Our MuMath-Code-7B achieves 83.8 on GSM8K and 52.4 on MATH, while MuMath-Code-70B model achieves new state-of-the-art performance among open methods -- achieving 90.7% on GSM8K and 55.1% on MATH. Extensive experiments validate the combination of tool use and data augmentation, as well as our two-stage training strategy. We release the proposed dataset along with the associated code for public use.
翻译:集成外部Python解释器的工具使用型大语言模型显著增强了开源语言模型的数学推理能力,而无需工具的方法则选择了另一条路径:增强数学推理数据。然而,如何将上述两条研究路径结合起来并融合它们的优势仍是一个有待探索的重要课题。本研究首先通过多视角数据增强方法引入新的数学问题,随后生成这些问题的代码嵌套式解答。对开源语言模型(即Llama-2)在增强数据集上进行微调,得到最终模型MuMath-Code($\mu$-Math-Code)。在推理阶段,我们的MuMath-Code生成代码并与外部Python解释器交互以获取执行结果。因此,MuMath-Code同时利用了外部工具与数据增强的优势。为充分挖掘增强数据的潜力,我们提出两阶段训练策略:第一阶段,在纯思维链数据上微调Llama-2获得中间模型;第二阶段,在代码嵌套数据上训练该中间模型得到最终模型MuMath-Code。我们的MuMath-Code-7B在GSM8K和MATH上分别达到83.8和52.4,而MuMath-Code-70B模型在开源方法中取得了新的最优性能——在GSM8K上达到90.7%,MATH上达到55.1%。大量实验验证了工具使用与数据增强相结合的有效性,以及我们提出的两阶段训练策略的优越性。我们将公开所提出的数据集及相关代码以供使用。