Mathematical problem solving is a fundamental benchmark for assessing the reasoning capabilities of artificial intelligence and a gateway to applications in education, science, and engineering where reliable symbolic reasoning is essential. Although recent advances in multi-agent LLM-based systems have enhanced their mathematical reasoning capabilities, they still lack a reliably revisable representation of the reasoning process. Existing agents either operate in rigid sequential pipelines that cannot correct earlier steps or rely on heuristic self-evaluation that can fail to identify and fix errors. In addition, programmatic context can distract language models and degrade accuracy. To address these gaps, we introduce Iteratively Improved Program Construction (IIPC), a reasoning method that iteratively refines programmatic reasoning chains and combines execution feedback with the native Chain-of-thought abilities of the base LLM to maintain high-level contextual focus. IIPC surpasses competing approaches in the majority of reasoning benchmarks on multiple base LLMs. All code and implementations are released as open source.
翻译:数学问题求解是评估人工智能推理能力的基本基准,也是教育、科学与工程应用中可靠符号推理不可或缺的入口。尽管基于多智能体大语言模型系统的最新进展提升了其数学推理能力,这些系统仍缺乏可可靠修正的推理过程表征。现有智能体要么运行于无法修正早期步骤的刚性顺序流程中,要么依赖可能无法识别和修正错误的启发式自我评估。此外,程序化语境可能干扰语言模型并降低准确性。为弥补这些不足,我们提出迭代改进程序构建方法——一种通过迭代优化程序化推理链,并将执行反馈与基础大语言模型固有的思维链能力相结合以保持高层语境聚焦的推理方法。该方法在多个基础大语言模型的主流推理基准测试中超越了现有竞争方法。所有代码与实现均已开源发布。