While frontier formal mathematics systems now routinely develop repository-scale proof engineering artifacts requiring multi-file coordination and semantic correctness beyond compilation, existing evaluation benchmarks remain focused on isolated theorem proving. We introduce Automated Proof Engineering (APE), the first systematic framework for evaluating repository-scale proof engineering through dual verification that validates both syntactic compilation and semantic requirement satisfaction in pinned library environments. We present a complete infrastructure comprising APE-Bench, which automatically extracts proof engineering tasks from real library commit histories, and APE-Harness, a unified execution framework based on task contract abstraction. This contract-based design enables standardized evaluation across diverse formal mathematics tasks and fair systematic comparison of different agent implementations (including our APE-Agent reference scaffold alongside Claude Code and Codex CLI) on identical task specifications. We demonstrate the framework's effectiveness through comprehensive evaluation. All code and benchmark dataset are released as open-source at https://github.com/xinhjBrant/APE-Bench.
翻译:尽管前沿的形式化数学系统如今已能常规化地开发需要多文件协同、且语义正确性超越编译的仓库级证明工程制品,但现有的评估基准仍局限于孤立的定理证明。我们提出了自动化证明工程(APE),这是首个通过双重验证来评估仓库级证明工程的系统性框架,该验证在固定的库环境中同时检验语法编译与语义需求的满足情况。我们展示了一套完整的基础设施,包括从真实库提交历史中自动提取证明工程任务的 APE-Bench,以及基于任务契约抽象的统一执行框架 APE-Harness。这种基于契约的设计实现了跨多样化形式化数学任务的标准化评估,并能在相同的任务规范上对不同智能体实现(包括我们的 APE-Agent 参考脚手架以及 Claude Code 和 Codex CLI)进行公平的系统性比较。我们通过全面评估证明了该框架的有效性。所有代码与基准数据集已在 https://github.com/xinhjBrant/APE-Bench 开源发布。