Large language models show promise for autoformalization, the task of automatically translating natural language into formal languages. However, current autoformalization methods remain limited. The last reported state-of-the-art performance on the ProofNet formalization benchmark for the Lean proof assistant, achieved using Codex for Lean 3, only showed successful formalization of 16.1% of informal statements. Similarly, our evaluation of GPT-4o for Lean 4 only produces successful translations 34.9% of the time. Our analysis shows that the performance of these models is largely limited by their inability to generate formal statements that successfully type-check (i.e., are syntactically correct and consistent with types) - with a whopping 86.6% of GPT-4o errors starting from a type-check failure. In this work, we propose a method to fix this issue through decoding with type-check filtering, where we initially sample a diverse set of candidate formalizations for an informal statement, then use the Lean proof assistant to filter out candidates that do not type-check. Using GPT-4o as a base model, and combining our method with self-consistency, we obtain a +18.3% absolute increase in formalization accuracy, and achieve a new state-of-the-art of 53.2% on ProofNet with Lean 4.
翻译:大型语言模型在自动形式化任务——即将自然语言自动转换为形式化语言——中展现出潜力。然而,当前的自动形式化方法仍存在局限。此前报道的在Lean证明辅助器的ProofNet形式化基准上的最佳性能(使用Codex针对Lean 3实现)仅成功形式化了16.1%的非形式化陈述。类似地,我们对GPT-4o在Lean 4上的评估显示,其成功翻译率仅为34.9%。我们的分析表明,这些模型的性能主要受限于其无法生成能成功通过类型检查的形式化陈述(即语法正确且类型一致)——高达86.6%的GPT-4o错误源于类型检查失败。在本工作中,我们提出一种通过带类型检查过滤的解码方法来修复此问题:首先对非形式化陈述采样一组多样化的候选形式化结果,随后利用Lean证明辅助器过滤掉未通过类型检查的候选。以GPT-4o为基础模型,并将我们的方法与自洽性结合,我们在形式化准确率上实现了+18.3%的绝对提升,并在ProofNet(Lean 4)上达到了53.2%的新最优性能。