Tool-augmented Large Language Models (TALM) are known to enhance the skillset of large language models (LLM), thereby, leading to their improved reasoning abilities across many tasks. While, TALMs have been successfully employed in different question-answering benchmarks, their efficacy on complex mathematical reasoning benchmarks, and the potential complimentary benefits offered by tools for knowledge retrieval and mathematical equation solving, are open research questions. In this work, we present MATHSENSEI, a tool-augmented large language model for mathematical reasoning. Augmented with tools for knowledge retrieval (Bing Web Search), program execution (Python), and symbolic equation solving (Wolfram-Alpha), we study the complimentary benefits of these tools through evaluations on mathematical reasoning datasets. We perform exhaustive ablations on MATH,a popular dataset for evaluating mathematical reasoning on diverse mathematical disciplines. We also conduct experiments involving well-known tool planners to study the impact of tool sequencing on the model performance. MATHSENSEI achieves 13.5% better accuracy over gpt-3.5-turbo with chain-of-thought on the MATH dataset. We further observe that TALMs are not as effective for simpler math word problems (in GSM-8k), and the benefit increases as the complexity and required knowledge increases (progressively over AQuA, MMLU-Math, and higher level complex questions in MATH). The code and data are available at https://github.com/Debrup-61/MathSensei.
翻译:工具增强型大型语言模型(TALM)已知能够提升大型语言模型(LLM)的技能集,从而增强其在多项任务中的推理能力。尽管TALM已成功应用于不同的问答基准测试中,但其在复杂数学推理基准测试上的效果,以及知识检索和数学方程求解工具可能提供的互补优势,仍是开放的研究问题。本文提出了MATHSENSEI,一种用于数学推理的工具增强型大型语言模型。通过集成知识检索工具(必应网页搜索)、程序执行工具(Python)和符号方程求解工具(Wolfram-Alpha),我们通过数学推理数据集的评估研究了这些工具的互补优势。我们对MATH(一个用于在多样化学科中评估数学推理的流行数据集)进行了彻底的消融实验。我们还进行了涉及知名工具规划器的实验,以研究工具排序对模型性能的影响。MATHSENSEI在MATH数据集上比基于思维链的gpt-3.5-turbo实现了13.5%的准确率提升。我们进一步观察到,TALM对于较简单的数学字问题(如GSM-8k)效果不佳,其优势随着问题复杂性和所需知识的增加而提升(在AQuA、MMLU-Math及MATH中更高级的复杂问题上逐步增强)。代码和数据可于https://github.com/Debrup-61/MathSensei获取。