Reliable verification of proofs remains a bottleneck for training and evaluating AI systems on hard mathematical reasoning. Fully formal proofs, in languages like Lean, are easy to verify because they are unambiguous and modular. Most proofs, particularly those written by AI systems, have neither property, and translating them into formal languages remains challenging in many frontier math settings. We propose Pseudo-Formalization (PF), a proof format that captures the modularity and precision of formal proofs while retaining the flexibility of natural language. A Pseudo-Formal proof is decomposed into self-contained modules, each stating its premises, conclusion, and proof in natural language. To verify the correctness of a regular natural language proof, an LLM translates it to Pseudo-Formal and then verifies each module independently, an algorithm we call Block Verification (BV). We evaluate PF+BV on two benchmarks spanning olympiad and research-level mathematics, where it pareto-dominates LLM-as-judge baselines on error-finding precision and recall. To support future work, we release our research-level proof verification benchmark ArxivMathGradingBench.
翻译:对证明的可靠验证仍是训练和评估人工智能系统处理复杂数学推理的瓶颈。完全形式化的证明(如Lean语言)因其无歧义性和模块化特性而易于验证。然而,绝大多数证明(尤其是人工智能系统生成的证明)均不具备这两种特性,且在诸多前沿数学场景中,将其转化为形式化语言仍具挑战性。我们提出"伪形式化"(PF)这一证明格式,既保留了形式化证明的模块化与精确性,又保持了自然语言的灵活性。伪形式化证明被分解为自包含模块,每个模块以自然语言陈述其前提、结论及证明过程。为验证常规自然语言证明的正确性,大语言模型将其转化为伪形式化格式,再对各模块进行独立验证——我们将此算法称为"分块验证"(BV)。我们在涵盖奥数级与研究级数学的两个基准测试中评估了PF+BV方法,其在错误发现的精确率与召回率上均帕累托优于LLM作为评判者的基线方法。为支持后续研究,我们发布了研究级数学证明验证基准数据集ArxivMathGradingBench。