Large language models (LLMs) have seen considerable advancements in natural language understanding tasks, yet there remains a gap to bridge before attaining true artificial general intelligence, especially concerning shortcomings in mathematical reasoning capabilities. We postulate that the inherent nature of LLM training, which focuses on predicting probabilities of next token, presents challenges in effectively modeling mathematical reasoning that demands exact calculations, both from data-driven and theoretical standpoints. In this paper, we address this challenge by enriching the data landscape and introducing a novel math dataset, enhanced with a capability to utilize a Python code interpreter. This dataset is derived from GSM8K and MATH and has been further refined through a combination of GPT-4 annotations, human review, and self-training processes, where the errors in the original GSM8K training set have been fixed. Additionally, we propose a tentative, easily replicable protocol for the fine-tuning of math-specific LLMs, which has led to a significant improvement in the performance of a 7B-parameter LLM on the GSM8K and MATH datasets. We are committed to advancing the field of mathematical reasoning in LLMs and, to that end, we have made the model checkpoints and will make the dataset publicly available. We hope this will facilitate further research and development within the community.
翻译:大型语言模型(LLMs)在自然语言理解任务上取得了显著进展,但在实现真正通用人工智能方面仍存在差距,尤其是在数学推理能力上的短板。我们认为,LLM训练本质上侧重于预测下一个词元的概率,这使得从数据驱动和理论角度有效建模需要精确计算的数学推理面临挑战。本文通过丰富数据分布并引入一种新型数学数据集来应对这一挑战,该数据集增强了利用Python代码解释器的能力。该数据集源自GSM8K和MATH,并通过GPT-4标注、人工审查及自我训练流程进一步优化,其中原始GSM8K训练集中的错误已被修正。此外,我们提出了一个初步且易于复现的数学专用LLM微调协议,该协议使一个70亿参数的LLM在GSM8K和MATH数据集上的性能显著提升。我们致力于推动LLM在数学推理领域的发展,为此公开了模型检查点,并将公开数据集,以期促进学术界的进一步研究与发展。