Large Language Models (LLMs) combined with program-based solving techniques are increasingly demonstrating proficiency in mathematical reasoning. However, such progress is mostly demonstrated in closed-source models such as OpenAI-GPT4 and Claude. In this paper, we seek to study the performance of strong open-source LLMs. Specifically, we analyze the outputs of Code Llama (7B) when applied to math word problems. We identify a category of problems that pose a challenge for the model, particularly those involving quantities that span multiple types or units. To address this issue, we propose a systematic approach by defining units for each quantity and ensuring the consistency of these units during mathematical operations. We developed Unit Consistency Programs (UCPs), an annotated dataset of math word problems, each paired with programs that contain unit specifications and unit verification routines. Finally, we finetune the Code Llama (7B) model with UCPs to produce VerityMath and present our preliminary findings.
翻译:大型语言模型(LLMs)结合基于程序的求解技术,在数学推理方面展现出日益增强的能力。然而,这一进展主要体现在OpenAI-GPT4和Claude等闭源模型中。本文旨在研究强开源LLMs的表现。具体而言,我们分析了Code Llama(7B)在数学应用题上的输出,发现其中一类涉及多种类型或单位数量的题目对模型构成挑战。为解决该问题,我们提出一种系统性方法:为每个数量定义单位,并确保数学运算中这些单位的一致性。我们开发了单位一致性程序(UCPs),这是一个带标注的数学应用题数据集,每个题目都配有包含单位规范和单位验证例程的程序。最后,我们使用UCPs微调Code Llama(7B)模型,生成了VerityMath,并展示了初步研究成果。