Baldur: Whole-Proof Generation and Repair with Large Language Models

Formally verifying software properties is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time, and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once, rather than one step at a time. We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power. As its main contributions, this paper demonstrates for the first time that: (1) Whole-proof generation using transformers is possible and is as effective as search-based techniques without requiring costly search. (2) Giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair and further improves automated proof generation. (3) We establish a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs. In addition to empirically showing the effectiveness of whole-proof generation, repair, and added context, we show that Baldur improves on the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification.

翻译：摘要：形式化验证软件属性是一项极具价值但劳动密集型的任务。近期工作开发了使用证明助手（如Coq和Isabelle/HOL）自动化形式化验证的方法，例如通过训练模型逐步骤预测证明，并利用该模型在可能的证明空间中进行搜索。本文提出一种自动化形式化验证的新方法：我们使用在大规模自然语言文本和代码上预训练、并在证明数据上微调的大语言模型，一次性生成定理的完整证明，而非逐步骤生成。我们将该证明生成模型与微调的修复模型相结合，以修复已生成的证明，从而进一步提升证明能力。作为主要贡献，本文首次证明：（1）基于Transformer的整证生成是可行的，其效果与基于搜索的技术相当，且无需代价高昂的搜索过程。（2）为学习模型提供额外上下文（例如先前的失败证明尝试及其产生的错误信息）可实现证明修复，并进一步改进自动化证明生成。（3）我们为完全自动化的证明合成建立了新的最优水平。我们将该方法实例化于原型系统Baldur，并在包含6,336个Isabelle/HOL定理及其证明的基准测试上进行评估。除实证展示整证生成、修复及上下文增强的有效性外，我们还表明Baldur相比当前最优工具Thor，可额外自动生成8.7%定理的证明。Baldur与Thor联合可完全自动化证明65.7%的定理。本文为利用大语言模型自动化形式化验证的新研究奠定了基础。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日