Large language models (LLMs) have made impressive progress in handling simple math problems, yet they still struggle with more challenging and complex mathematical tasks. In this paper, we introduce a series of LLMs that employs the Decomposition of thought with code assistance and self-correction for mathematical reasoning, dubbed as DotaMath. DotaMath models tackle complex mathematical tasks by decomposing them into simpler logical subtasks, leveraging code to solve these subtasks, obtaining fine-grained feedback from the code interpreter, and engaging in self-reflection and correction. By annotating diverse interactive tool-use trajectories and employing query evolution on GSM8K and MATH datasets, we generate an instruction fine-tuning dataset called DotaMathQA with 574K query-response pairs. We train a series of base LLMs using imitation learning on DotaMathQA, resulting in DotaMath models that achieve remarkable performance compared to open-source LLMs across various in-domain and out-of-domain benchmarks. Notably, DotaMath-deepseek-7B showcases an outstanding performance of 64.8% on the competitive MATH dataset and 86.7% on GSM8K. Besides, DotaMath-deepseek-7B maintains strong competitiveness on a series of in-domain and out-of-domain benchmarks (Avg. 80.1%). Looking forward, we anticipate that the DotaMath paradigm will open new pathways for addressing intricate mathematical problems. Our code is publicly available at https://github.com/ChengpengLi1003/DotaMath.
翻译:大语言模型在处理简单数学问题上取得了显著进展,但在应对更具挑战性和复杂性的数学任务时仍面临困难。本文介绍了一系列采用基于代码辅助与自我校正的思维分解方法进行数学推理的大语言模型,命名为DotaMath。DotaMath模型通过将复杂数学任务分解为更简单的逻辑子任务,利用代码求解这些子任务,从代码解释器获取细粒度反馈,并进行自我反思与校正来处理复杂数学问题。通过在GSM8K和MATH数据集上标注多样化的交互式工具使用轨迹并采用查询演化技术,我们生成了一个包含57.4万条查询-响应对的指令微调数据集DotaMathQA。我们使用模仿学习在DotaMathQA上训练了一系列基础大语言模型,得到的DotaMath模型在各类领域内和领域外基准测试中相比开源大语言模型展现出卓越性能。值得注意的是,DotaMath-deepseek-7B在竞争激烈的MATH数据集上达到64.8%的优异表现,在GSM8K上达到86.7%。此外,该模型在一系列领域内和领域外基准测试中保持强劲竞争力(平均80.1%)。展望未来,我们预期DotaMath范式将为解决复杂数学问题开辟新途径。代码已公开于https://github.com/ChengpengLi1003/DotaMath。