Formal reasoning and automated theorem proving constitute a challenging subfield of machine learning, in which machines are tasked with proving mathematical theorems using formal languages like Lean. A formal verification system can check whether a formal proof is correct or not almost instantaneously, but generating a completely correct formal proof with large language models (LLMs) remains a formidable task. The usual approach in the literature is to prompt the LLM many times (up to several thousands) until one of the generated proofs passes the verification system. In this work, we present APOLLO (Automated PrOof repair viaLLM and Lean cOllaboration), a modular, model-agnostic agentic framework that combines the strengths of the Lean compiler with an LLM's reasoning abilities to achieve better proof-generation results at a low token and sampling budgets. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub-lemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low top-K budget. The repaired sub-proofs are recombined and reverified, iterating up to a user-controlled maximum number of attempts. On the miniF2F benchmark, we establish a new state-of-the-art accuracy of 84.9% among sub 8B-parameter models (as of August 2025) while keeping the sampling budget below one hundred. Moreover, Apollo raises the state-of-the-art accuracy for Goedel-Prover-SFT to 65.6% while cutting sample complexity from 25,600 to a few hundred. General-purpose models (o3-mini, o4-mini) jump from 3-7% to over 40% accuracy. Our results demonstrate that targeted, compiler-guided repair of LLM outputs yields dramatic gains in both efficiency and correctness, suggesting a general paradigm for scalable automated theorem proving.
翻译:形式化推理与自动化定理证明是机器学习中一个具有挑战性的子领域,其任务是让机器使用如Lean等形式化语言来证明数学定理。形式化验证系统几乎可以即时检查形式化证明的正确性,但利用大语言模型生成完全正确的形式化证明仍是一项艰巨任务。文献中的常规方法是多次提示LLM(可达数千次),直至生成的证明之一通过验证系统。本文提出APOLLO(基于LLM与Lean协同的自动化证明修复框架),这是一个模块化、模型无关的智能体框架,它结合了Lean编译器的能力与LLM的推理优势,从而在较低的token消耗和采样预算下实现更优的证明生成效果。APOLLO引导一个全自动化流程:LLM为定理生成证明,一组智能体分析证明、修复语法错误、利用Lean识别证明中的错误、分离失败的子引理、调用自动化求解器,并在每个剩余目标上以较低的top-K预算调用LLM。修复后的子证明被重新组合并再次验证,该过程迭代进行,迭代次数受用户控制的最大尝试数限制。在miniF2F基准测试中,我们在参数量低于80亿的模型中(截至2025年8月)取得了84.9%的最新最优准确率,同时将采样预算控制在百次以内。此外,APOLLO将Goedel-Prover-SFT的最优准确率提升至65.6%,同时将采样复杂度从25,600次大幅降至数百次。通用模型(o3-mini、o4-mini)的准确率从3-7%跃升至40%以上。我们的结果表明,针对LLM输出进行编译器引导的定向修复,能在效率和正确性上带来显著提升,这为可扩展的自动化定理证明提供了一种通用范式。