Improving Autoformalization using Type Checking

Large language models show promise for autoformalization, the task of automatically translating natural language into formal languages. However, current autoformalization methods remain limited. The last reported state-of-the-art performance on the ProofNet formalization benchmark for the Lean proof assistant, achieved using Codex for Lean 3, only showed successful formalization of 16.1% of informal statements. Similarly, our evaluation of GPT-4o for Lean 4 only produces successful translations 34.9% of the time. Our analysis shows that the performance of these models is largely limited by their inability to generate formal statements that successfully type-check (i.e., are syntactically correct and consistent with types) - with a whopping 86.6% of GPT-4o errors starting from a type-check failure. In this work, we propose a method to fix this issue through decoding with type-check filtering, where we initially sample a diverse set of candidate formalizations for an informal statement, then use the Lean proof assistant to filter out candidates that do not type-check. Using GPT-4o as a base model, and combining our method with self-consistency, we obtain a +18.3% absolute increase in formalization accuracy, and achieve a new state-of-the-art of 53.2% on ProofNet with Lean 4.

翻译：大型语言模型在自动形式化任务——即将自然语言自动转换为形式化语言——中展现出潜力。然而，当前的自动形式化方法仍存在局限。此前报道的在Lean证明辅助器的ProofNet形式化基准上的最佳性能（使用Codex针对Lean 3实现）仅成功形式化了16.1%的非形式化陈述。类似地，我们对GPT-4o在Lean 4上的评估显示，其成功翻译率仅为34.9%。我们的分析表明，这些模型的性能主要受限于其无法生成能成功通过类型检查的形式化陈述（即语法正确且类型一致）——高达86.6%的GPT-4o错误源于类型检查失败。在本工作中，我们提出一种通过带类型检查过滤的解码方法来修复此问题：首先对非形式化陈述采样一组多样化的候选形式化结果，随后利用Lean证明辅助器过滤掉未通过类型检查的候选。以GPT-4o为基础模型，并将我们的方法与自洽性结合，我们在形式化准确率上实现了+18.3%的绝对提升，并在ProofNet（Lean 4）上达到了53.2%的新最优性能。

相关内容

GPT-4o

关注 1

GPT-4o（“o”代表“omni”）朝着更自然的人机交互迈出了一步——它可以接受任何组合的文本、音频和图像作为输入，并生成任何组合的文本、音频和图像输出。它对音频输入的响应时间最短可达232毫秒，平均为320毫秒，这与人类在对话中的响应时间相似。在英语文本和代码处理上，它的性能与GPT-4 Turbo相当，但在非英语文本处理方面有显著改进，同时在API中速度更快且成本降低50%。与现有模型相比，GPT-4o在视觉和音频理解方面尤其出色。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日