Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We perform the first large-scale evaluation of this method's ability to solve open problems. Our most capable agent autonomously resolved 9 of 353 open Erdős problems at the per-problem cost of a few hundred dollars, proved 44/492 OEIS conjectures, and is being deployed in combinatorics, optimization, graph theory, algebraic geometry, and quantum optics research. A basic agent alternating LLM-based generation with Lean-based verification replicated the Erdős successes but proved costlier on the hardest problems. These findings demonstrate the power of AI-aided formal proof search and shed light on the agent designs that enable it.
翻译:大型语言模型在数学推理方面日益表现出色,但其不可靠性限制了它们在数学研究中的实用性。一种缓解方法是使用大型语言模型生成形式化证明,例如在Lean语言中。我们首次大规模评估了该方法解决开放问题的能力。我们最强大的智能体在353个开放的埃尔德什问题中自主解决了9个,每个问题成本仅为数百美元,并证明了44/492个OEIS猜想,目前正应用于组合学、优化、图论、代数几何和量子光学研究。一个交替使用基于大型语言模型的生成与基于Lean的验证的基本智能体,复现了埃尔德什问题的成功,但在最难问题上成本更高。这些发现展示了AI辅助形式化证明搜索的能力,并揭示了实现这一能力的智能体设计。