Mathematical problem solving is a fundamental benchmark for assessing the reasoning capabilities of artificial intelligence and a gateway to applications in education, science, and engineering where reliable symbolic reasoning is essential. Although recent advances in multi-agent LLM-based systems have enhanced their mathematical reasoning capabilities, they still lack a reliably revisable representation of the reasoning process. Existing agents either operate in rigid sequential pipelines that cannot correct earlier steps or rely on heuristic self-evaluation that can fail to identify and fix errors. In addition, programmatic context can distract language models and degrade accuracy. To address these gaps, we introduce Iteratively Improved Program Construction (IIPC), a reasoning method that iteratively refines programmatic reasoning chains and combines execution feedback with the native Chain-of-thought abilities of the base LLM to maintain high-level contextual focus. IIPC surpasses competing approaches in the majority of reasoning benchmarks on multiple base LLMs. All code and implementations are released as open source.
翻译:数学问题求解是评估人工智能推理能力的基本基准,也是教育、科学和工程领域中需要可靠符号推理应用的关键入口。尽管基于多智能体大语言模型系统的最新进展提升了其数学推理能力,但这些系统仍缺乏对推理过程可靠且可修正的表示。现有智能体要么在僵化的顺序流程中运行而无法修正早期步骤,要么依赖启发式自我评估,而后者可能无法识别和修复错误。此外,程序化上下文可能分散语言模型的注意力并降低准确性。为解决这些不足,我们提出了迭代改进程序构建(IIPC),这是一种迭代优化程序化推理链的推理方法,它将执行反馈与基础大语言模型固有的思维链能力相结合,以保持高层次的情境聚焦。在多个基础大语言模型的大多数推理基准测试中,IIPC均超越了竞争方法。所有代码与实现均已开源发布。