Tool-augmented Large Language Models (TALM) are known to enhance the skillset of large language models (LLM), thereby, leading to their improved reasoning abilities across many tasks. While, TALMs have been successfully employed in different question-answering benchmarks, their efficacy on complex mathematical reasoning benchmarks, and the potential complimentary benefits offered by tools for knowledge retrieval and mathematical equation solving, are open research questions. In this work, we present MATHSENSEI, a tool-augmented large language model for mathematical reasoning. Augmented with tools for knowledge retrieval (Bing Web Search), program execution (Python), and symbolic equation solving (Wolfram-Alpha), we study the complimentary benefits of these tools through evaluations on mathematical reasoning datasets. We perform exhaustive ablations on MATH,a popular dataset for evaluating mathematical reasoning on diverse mathematical disciplines. We also conduct experiments involving well-known tool planners to study the impact of tool sequencing on the model performance. MATHSENSEI achieves 13.5% better accuracy over gpt-3.5-turbo with chain-of-thought on the MATH dataset. We further observe that TALMs are not as effective for simpler math word problems (in GSM-8k), and the benefit increases as the complexity and required knowledge increases (progressively over AQuA, MMLU-Math, and higher level complex questions in MATH). The code and data are available at https://github.com/Debrup-61/MathSensei.
翻译:工具增强型大语言模型(TALM)已知能扩展大语言模型(LLM)的技能集,从而提升其在多项任务中的推理能力。尽管TALM已成功应用于不同问答基准测试,但其在复杂数学推理基准测试中的有效性,以及工具在知识检索和数学方程求解方面潜在的互补优势,仍是开放研究问题。本文提出MATHSENSEI——一种面向数学推理的工具增强型大语言模型。通过集成知识检索(Bing Web搜索)、程序执行(Python)和符号方程求解(Wolfram-Alpha)工具,我们通过评估数学推理数据集研究了这些工具的互补优势。我们在涵盖多数学学科的数学推理常用数据集MATH上进行了全面的消融实验,并设计了涉及已知工具规划器的实验以研究工具排序对模型性能的影响。在MATH数据集上,MATHSENSEI相比使用思维链的gpt-3.5-turbo的准确率提升13.5%。进一步观察发现,TALM在简单数学文字题(GSM-8k)上效果有限,其优势随问题复杂度和所需知识增加而提升(在AQuA、MMLU-Math及MATH中更高级的复杂问题上逐步增强)。代码和数据集见https://github.com/Debrup-61/MathSensei。