Large Language Models (LLMs), combined with program-based solving techniques, are increasingly demonstrating proficiency in mathematical reasoning. For example, closed-source models such as OpenAI GPT-4 and Claude show excellent results in solving math word problems. However, progress in math word problem-solving for open-source LLMs is limited, and the challenges these models face are not well-studied. In this paper, we study the performance of strong open-source LLMs, including Llama 2 (7B), Code Llama (7B), and Mistral (7B) on math word problems using program-based solving techniques. Specifically, we analyze the outputs of these models when applied to math word problems and identify a category of problems that pose a significant challenge, particularly those involving quantities spanning multiple units. To address this issue, we propose a systematic approach by defining the units for each quantity and ensuring the consistency of these units during mathematical operations. We developed Unit Consistency Programs (UCPs), an annotated dataset of math word problems, each paired with programs containing unit specifications and unit verification routines. We fine-tuned Llama 2 (7B), Code Llama (7B), and Mistral (7B) models with UCPs to produce theirVerityMath variants. Our findings indicate that our approach, which incorporates unit consistency, currently slightly underperforms compared to an approach that does not. To understand the reasons behind this, we conduct an in-depth error analysis and suggest options for future improvements. Our code and dataset are available at https://github.com/vernontoh/VerityMath.
翻译:大型语言模型(LLMs)与基于程序的求解技术相结合,在数学推理方面日益展现出卓越能力。例如,OpenAI GPT-4和Claude等闭源模型在解决数学应用题方面表现出色。然而,开源LLMs在数学应用题求解方面的进展有限,且这些模型面临的挑战尚未得到充分研究。本文研究了包括Llama 2(7B)、Code Llama(7B)和Mistral(7B)在内的强大开源LLMs在使用基于程序求解技术处理数学应用题时的表现。具体而言,我们分析了这些模型应用于数学应用题时的输出结果,发现存在一类具有显著挑战性的问题,特别是涉及跨多单位量值的问题。为解决此问题,我们提出一种系统化方法:为每个量值定义单位,并在数学运算过程中确保单位一致性。我们开发了单位一致性程序数据集——一个经过标注的数学应用题数据集,其中每个问题均配有包含单位规范与单位验证例程的程序。我们使用UCPs对Llama 2(7B)、Code Llama(7B)和Mistral(7B)模型进行微调,得到其VerityMath变体。研究结果表明,当前融入单位一致性的方法在性能上略逊于未采用此方法的方法。为探究其背后原因,我们进行了深入的错误分析,并为未来改进提出了建议方案。代码与数据集已发布于https://github.com/vernontoh/VerityMath。