Chain-of-thought prompting~(CoT) and tool augmentation have been validated in recent work as effective practices for improving large language models~(LLMs) to perform step-by-step reasoning on complex math-related tasks. However, most existing math reasoning datasets may be not able to fully evaluate and analyze the ability of LLMs in manipulating tools and performing reasoning, as they may only require very few invocations of tools or miss annotations for evaluating intermediate reasoning steps. To address the issue, we construct \textbf{CARP}, a new Chinese dataset consisting of 4,886 computation-intensive algebra problems with formulated annotations on intermediate steps. In CARP, we test four LLMs with CoT prompting, and find that they are all prone to make mistakes at the early steps of the solution, leading to wrong answers. Based on this finding, we propose a new approach that can deliberate the reasoning steps with tool interfaces, namely \textbf{DELI}. In DELI, we first initialize a step-by-step solution based on retrieved exemplars, then iterate two deliberation procedures that check and refine the intermediate steps of the generated solution, from the perspectives of tool manipulation and natural language reasoning, until obtaining converged solutions or reaching the maximum turn. Experimental results on CARP and six other datasets show that the proposed DELI mostly outperforms competitive baselines, and can further boost the performance of existing CoT methods. Our data and code are available in \url{https://github.com/RUCAIBox/CARP}.
翻译:链式思维提示(CoT)与工具增强在近期工作中已被验证为提升大语言模型(LLMs)在复杂数学相关任务上逐步推理能力的有效实践。然而,现有大多数数学推理数据集可能无法充分评估和分析LLMs在工具操作与推理方面的能力,因为它们可能仅需极少的工具调用或缺乏对中间推理步骤的标注。为解决该问题,我们构建了\textbf{CARP}——一个包含4,886道带中间步骤格式化标注的计算密集型代数问题的新中文数据集。在CARP上,我们测试了四种采用CoT提示的LLMs,发现它们均易在解题早期步骤犯错,导致最终答案错误。基于此发现,我们提出了一种新方法,即\textbf{DELI},该方法可通过工具接口对推理步骤进行深思熟虑。在DELI中,我们首先基于检索到的示例初始化逐步解决方案,随后迭代执行两个深思熟虑流程(分别从工具操作和自然语言推理的角度检查并改进生成解法的中间步骤),直至获得收敛解或达到最大轮次。在CARP及其他六个数据集上的实验结果表明,所提出的DELI方法大多优于竞争基线,并能进一步提升现有CoT方法的性能。我们的数据和代码发布在\url{https://github.com/RUCAIBox/CARP}。