Proof autoformalization, the task of translating natural language theorems and proofs into machine-verifiable code, is a critical step for integrating large language models into rigorous mathematical workflows. Current approaches focus on producing executable code, but they frequently fail to preserve the semantic meaning and logical structure of the original human-written argument. To address this, we introduce ProofFlow, a novel pipeline that treats structural fidelity as a primary objective. ProofFlow first constructs a directed acyclic graph (DAG) to map the logical dependencies between proof steps. Then, it employs a novel lemma-based approach to systematically formalize each step as an intermediate lemma, preserving the logical structure of the original argument. To facilitate evaluation, we present a new benchmark of 184 undergraduate-level problems, manually annotated with step-by-step solutions and logical dependency graphs, and introduce ProofScore, a new composite metric to evaluate syntactic correctness, semantic faithfulness, and structural fidelity. Experimental results show our pipeline sets a new state-of-the-art for autoformalization, achieving a ProofScore of 0.545, substantially exceeding baselines like full-proof formalization (0.123), which processes the entire proof at once, and step-proof formalization (0.072), which handles each step independently. Our pipeline, benchmark, and score metric are open-sourced to encourage further progress at https://github.com/Huawei-AI4Math/ProofFlow.
翻译:证明自动形式化是将自然语言定理及证明转化为机器可验证代码的任务,是将大语言模型集成至严格数学工作流的关键环节。现有方法主要关注生成可执行代码,但往往无法保持原始人工论证的语义含义与逻辑结构。为此,我们提出ProofFlow——一种以结构保真为首要目标的新型流程框架。该方法首先构建有向无环图以映射证明步骤间的逻辑依赖关系,随后采用创新的基于引理的方法,将每个步骤系统性地形式化为中间引理,从而完整保留原始论证的逻辑结构。为便于评估,我们构建了包含184个本科难度问题的新基准数据集,其中人工标注了逐步解答与逻辑依赖图,并提出ProofScore这一综合评价指标,用于衡量句法正确性、语义忠实度与结构保真度。实验结果表明,我们的流程在自动形式化任务中取得了最先进的性能,ProofScore达到0.545,显著优于全证明形式化(0.123)与分步证明形式化(0.072)等基线方法。我们的流程框架、基准数据集与评估指标已在https://github.com/Huawei-AI4Math/ProofFlow开源,以促进该领域的进一步发展。