Large Language Models (LLMs) have made notable progress in automated theorem proving, yet existing formal benchmarks remain limited in both mathematical coverage and difficulty. Most are concentrated in areas that are easier to formalize, such as algebra and elementary number theory, and provide limited coverage of subfields that require deeper reasoning, including mathematical analysis. To address this gap, we introduce MA-ProofBench, to the best of our knowledge, the first formal theorem-proving benchmark dedicated to Mathematical Analysis. The benchmark contains 200 formalized theorems covering 6 core topics and 27 subcategories, including measure and integration theory, complex analysis, and functional analysis. The problems are divided into two difficulty levels, an undergraduate level (Level I, 100 problems) and a Ph.D. qualifying level (Level II, 100 problems), to evaluate how well LLMs perform formal reasoning at different mathematical depths. Each problem is constructed through a human-led, LLM-assisted formalization pipeline followed by independent expert review, ensuring that the formal statements remain faithful to the original mathematics. We evaluate a range of recent general-purpose reasoning models and formal theorem provers on MA-ProofBench. However, most models perform poorly: even the best-performing model, GPT-5.5, achieves only 16% Pass@8 on Level I and 5% on Level II, while most models stay close to 0% on Level II. Further analysis identifies Mathlib hallucinations and incomplete proofs as the two dominant failure modes, while an evaluation on the natural-language version of the benchmark exposes a clear gap between informal and formal reasoning. MA-ProofBench is intended to serve as a reliable reference for tracking progress in formal mathematical reasoning in advanced domains.
翻译:大语言模型(LLMs)在自动定理证明方面取得了显著进展,然而现有形式化基准在数学覆盖范围和难度上仍存在局限。大多数基准集中在更易于形式化的领域,如代数和初等数论,而对需要更深层推理的子领域(包括数学分析)覆盖有限。为弥补这一空白,我们引入了MA-ProofBench——据我们所知,这是首个专为数学分析设计的形式化定理证明基准。该基准包含200个形式化定理,涵盖6个核心主题和27个子类别,包括测度与积分理论、复分析和泛函分析。问题分为两个难度等级:本科水平(第一级,100道题)和博士资格考试水平(第二级,100道题),以评估LLM在不同数学深度下执行形式化推理的能力。每个问题通过一个以人类为主导、LLM辅助的形式化流程构建,随后经过独立专家评审,确保形式化表述忠实于原始数学内容。我们评估了近期一系列通用推理模型和形式化定理证明器在MA-ProofBench上的表现。然而,大多数模型表现不佳:即使是最佳模型GPT-5.5,在第一级上也仅达到16%的Pass@8,在第二级上为5%,而大多数模型在第二级上的表现接近0%。进一步分析将Mathlib幻觉和不完整证明确定为两种主导失败模式,同时,在基准的自然语言版本上的评估揭示了非形式化推理与形式化推理之间的明显差距。MA-ProofBench旨在作为追踪高级领域形式化数学推理进展的可靠参考。