Translating human-written mathematical theorems and proofs from natural language (NL) into formal languages (FLs) like Lean 4 has long been a significant challenge for AI. Most state-of-the-art methods either focus on theorem-only NL-to-FL auto-formalization or on FL proof synthesis from FL theorems. In practice, auto-formalization of both theorem and proof still requires human intervention, as seen in AlphaProof's silver-medal performance at the 2024 IMO, where problem statements were manually translated before automated proof synthesis. We present ProofBridge, a unified framework for automatically translating entire NL theorems and proofs into Lean 4. At its core is a joint embedding model that aligns NL and FL (NL-FL) theorem+proof pairs in a shared semantic space, enabling cross-modal retrieval of semantically relevant FL examples to guide translation. ProofBridge integrates retrieval-augmented fine-tuning with iterative proof repair, leveraging Lean's type checker and semantic equivalence feedback to ensure both syntactic correctness and semantic fidelity. Experiments show substantial improvements in proof auto-formalization over strong baselines (including GPT-5, Gemini-2.5, Kimina-Prover, DeepSeek-Prover), with our retrieval-augmented approach yielding significant gains in semantic correctness (SC, via proving bi-directional equivalence) and type correctness (TC, via type-checking theorem+proof) across pass@k metrics on miniF2F-Test-PF, a dataset we curated. In particular, ProofBridge improves cross-modal retrieval quality by up to 3.28x Recall@1 over all-MiniLM-L6-v2, and achieves +31.14% SC and +1.64% TC (pass@32) compared to the baseline Kimina-Prover-RL-1.7B.
翻译:将人类撰写的数学定理及证明从自然语言(NL)自动翻译成Lean 4等形式语言(FL),长久以来一直是人工智能领域面临的重大挑战。当前最先进的方法要么仅聚焦于定理的自然语言到形式语言的自动形式化,要么侧重于基于形式定理的形式证明合成。然而在实践中,定理与证明的联合自动形式化仍需人工干预——正如AlphaProof在2024年IMO中获得银牌的表现:在自动证明合成之前,问题陈述仍采用手动翻译。我们提出ProofBridge,一个用于将完整自然语言定理及证明自动翻译为Lean 4的统一框架。其核心是一个联合嵌入模型,通过将自然语言与形式语言的定理+证明对对齐到共享语义空间,实现跨模态检索语义相关的形式语言示例以指导翻译。ProofBridge集成了检索增强微调与迭代证明修复机制,利用Lean的类型检查器和语义等价反馈,确保语法正确性与语义忠实度。实验表明,在证明自动形式化任务上,我们的方法较多个强基线模型(包括GPT-5、Gemini-2.5、Kimina-Prover、DeepSeek-Prover)取得显著提升。基于我们构建的miniF2F-Test-PF数据集,检索增强方法在pass@k指标上的语义正确性(SC,通过双向等价性证明衡量)和类型正确性(TC,通过定理+证明类型检查衡量)均实现重大改进。特别地,ProofBridge在跨模态检索质量上较all-MiniLM-L6-v2实现最高3.28倍的Recall@1提升,并与基线模型Kimina-Prover-RL-1.7B相比,在pass@32指标上分别取得+31.14%的SC提升和+1.64%的TC提升。