FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce FATE (Formal Algebra Theorem Evaluation), a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new benchmark reveal a stark performance gap compared to contest math: the best model achieves only 3% (pass@64) accuracy on FATE-H and 0% on FATE-X. Our two-stage evaluation reveals that models' natural-language reasoning is notably more accurate than their ability to formalize this reasoning. We systematically classify the common errors that arise during this formalization process. Furthermore, a comparative study shows that a specialized prover can exhibit less effective reflection than general-purpose models, reducing its accuracy at the natural-language stage. We believe FATE provides a robust and challenging benchmark that establishes essential checkpoints on the path toward research-level formal mathematical reasoning.

翻译：近期大型语言模型（LLM）在形式化定理证明领域展现出卓越能力，尤其在IMO等竞赛类数学基准测试中表现突出。然而，这些竞赛未能体现现代数学研究的深度、广度与抽象性。为弥补这一差距，我们提出了FATE（形式化代数定理评估）——一个面向形式化代数领域的新基准系列，旨在为高级数学推理规划发展路径。我们推出了两个新组件FATE-H与FATE-X，各包含100道抽象代数与交换代数问题。FATE系列涵盖从本科习题到超越博士资格考试的难度谱系。值得注意的是，FATE-X是首个在难度上超越博士水平考试、在覆盖范围上超越Mathlib库的形式化基准。我们基于该新基准对前沿LLM证明器进行评估，发现其表现与竞赛数学存在显著差距：最佳模型在FATE-H上仅达到3%（pass@64）准确率，在FATE-X上则为0%。我们的两阶段评估表明，模型的自然语言推理能力明显优于其形式化表达能力。我们系统性地对形式化过程中出现的常见错误进行了分类。此外，对比研究表明，专用证明器可能比通用模型表现出更弱的反思能力，导致其在自然语言阶段的准确率下降。我们相信FATE为研究级形式化数学推理之路建立了关键检查点，提供了一个稳健且具有挑战性的基准。