MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

Large Language Models (LLMs) have made notable progress in automated theorem proving, yet existing formal benchmarks remain limited in both mathematical coverage and difficulty. Most are concentrated in areas that are easier to formalize, such as algebra and elementary number theory, and provide limited coverage of subfields that require deeper reasoning, including mathematical analysis. To address this gap, we introduce MA-ProofBench, to the best of our knowledge, the first formal theorem-proving benchmark dedicated to Mathematical Analysis. The benchmark contains 200 formalized theorems covering 6 core topics and 27 subcategories, including measure and integration theory, complex analysis, and functional analysis. The problems are divided into two difficulty levels, an undergraduate level (Level I, 100 problems) and a Ph.D. qualifying level (Level II, 100 problems), to evaluate how well LLMs perform formal reasoning at different mathematical depths. Each problem is constructed through a human-led, LLM-assisted formalization pipeline followed by independent expert review, ensuring that the formal statements remain faithful to the original mathematics. We evaluate a range of recent general-purpose reasoning models and formal theorem provers on MA-ProofBench. However, most models perform poorly: even the best-performing model, GPT-5.5, achieves only 16% Pass@8 on Level I and 5% on Level II, while most models stay close to 0% on Level II. Further analysis identifies Mathlib hallucinations and incomplete proofs as the two dominant failure modes, while an evaluation on the natural-language version of the benchmark exposes a clear gap between informal and formal reasoning. MA-ProofBench is intended to serve as a reliable reference for tracking progress in formal mathematical reasoning in advanced domains.

翻译：大型语言模型（LLMs）在自动化定理证明领域取得了显著进展，然而现有形式化基准在数学覆盖范围和难度上仍存在局限。多数基准集中于更易形式化的领域（如代数和初等数论），对需要深层推理的子领域（包括数学分析）覆盖有限。为弥补这一空白，我们提出MA-ProofBench——据我们所知，这是首个专注于数学分析的形式化定理证明基准。该基准包含200个形式化定理，覆盖6大核心主题及27个子类别，涵盖测度与积分理论、复分析、泛函分析等领域。问题按难度分为两个层级：本科水平（Level I，100题）和博士资格考试水平（Level II，100题），以评估LLMs在不同数学深度下的形式推理能力。每个问题通过"人类主导、LLM辅助"的形式化构建流程生成，并经独立专家审校，确保形式化表述忠实于原始数学内容。我们在MA-ProofBench上评估了多种通用推理模型和形式化定理证明器。然而，多数模型表现不佳：即便是最佳模型GPT-5.5，在Level I上的Pass@8也仅达16%，在Level II上为5%，而多数模型在Level II上几乎为零。进一步分析识别出“Mathlib幻觉”和“不完整证明”为两类主要失败模式；对基准自然语言版本的评估则揭示了非形式化推理与形式化推理之间的明显差距。MA-ProofBench旨在作为追踪高级领域形式化数学推理进展的可靠参考。