Large language models (LLMs) have shown increasing competence in solving mathematical reasoning problems. However, many open-source LLMs still struggle with errors in calculation and semantic understanding during intermediate reasoning steps. In this work, we introduce Prove, a simple yet effective framework that leverages translated programs derived from natural language solutions as a verification mechanism to filter out potentially incorrect reasoning paths before aggregating final answers. Unlike vanilla majority voting, our approach filters out solutions whose corresponding program output is inconsistent with the generated solution, aggregating only those that pass verification. We conducted extensive experiments using 13 open-source LLMs from various model families and sizes, ranging from 0.5B to 13B parameters, across eight mathematical benchmarks. Our results show that Prove consistently outperforms vanilla majority voting as a heuristic for solving mathematical reasoning tasks across all model sizes and datasets, achieving improvements of up to 18% on GSM8K and 8% on MATH-500. Our codes are available at https://github.com/declare-lab/prove.
翻译:大型语言模型在解决数学推理问题方面展现出日益增强的能力。然而,许多开源语言模型在中间推理步骤中仍存在计算和语义理解错误。本研究提出Prove框架,该框架通过将自然语言解转换为程序代码,并以此作为验证机制,在聚合最终答案前过滤潜在错误的推理路径,其设计简洁而高效。与原始多数投票法不同,我们的方法会筛除对应程序输出与生成解不一致的解决方案,仅聚合通过验证的结果。我们使用涵盖不同模型系列和规模的13个开源语言模型(参数量从0.5B到13B)在八个数学基准测试上进行了广泛实验。结果表明,Prove作为数学推理任务的启发式方法,在所有模型规模和数据集上均持续优于原始多数投票法,在GSM8K上最高提升18%,在MATH-500上提升8%。代码已开源:https://github.com/declare-lab/prove。