We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn from qualifying exams and standard textbooks across topics including analysis, algebra, probability, and logic. We evaluate a range of frontier models with an agentic harness, and find that the best-performing foundation model achieves 33.5% accuracy, with performance dropping rapidly after that. In addition to the accuracy numbers, we also provide empirical analysis of tool-use, failure modes, cost and latency, thereby providing a thorough evaluation of the formal-theorem proving abilities of frontier models.
翻译:我们提出了FormalProofBench,这是一个私有基准测试,旨在评估AI模型能否生成经过形式化验证的研究生级别数学证明。每个任务对应一个自然语言问题及其在Lean 4中的形式化表述,模型需输出能被Lean 4检查器接受的证明代码。FormalProofBench聚焦于高阶本科及研究生数学,题目源自资格考试与标准教材,涵盖分析、代数、概率与逻辑等方向。我们通过智能体框架评估了一系列前沿模型,发现最佳基础模型的准确率为33.5%,此后模型性能迅速下降。除准确率数据外,我们还对工具使用、失败模式、成本与延迟进行了实证分析,从而对前沿模型的形式化定理证明能力提供了全面评估。