Baldur: Whole-Proof Generation and Repair with Large Language Models

Formally verifying software properties is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time, and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once, rather than one step at a time. We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power. As its main contributions, this paper demonstrates for the first time that: (1) Whole-proof generation using transformers is possible and is as effective as search-based techniques without requiring costly search. (2) Giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair and further improves automated proof generation. (3) We establish a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs. In addition to empirically showing the effectiveness of whole-proof generation, repair, and added context, we show that Baldur improves on the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification.

翻译：摘要：对软件性质进行形式化验证是一项高度理想但劳动密集型的任务。近期研究开发了利用Coq和Isabelle/HOL等证明辅助工具自动化形式化验证的方法，例如通过训练模型每次预测一个证明步骤，并利用该模型在可能的证明空间中进行搜索。本文提出了一种自动化形式化验证的新方法：我们使用在大规模自然语言文本和代码上训练、并在证明数据上微调的大语言模型，一次性生成定理的完整证明，而非逐步生成。我们将此证明生成模型与微调后的修复模型相结合，以修复生成的证明，从而进一步提升证明能力。本文的主要贡献在于首次证明：（1）使用Transformer进行整证生成是可行的，且其效果与基于搜索的技术相当，同时无需昂贵的搜索过程。（2）为学习模型提供额外上下文（例如先前的失败证明尝试及随之而来的错误信息）能够实现证明修复，并进一步改善自动化证明生成。（3）我们建立了全自动证明合成的新标杆。我们将该方法实现为原型系统Baldur，并在包含6,336个Isabelle/HOL定理及其证明的基准测试上进行了评估。除实证展示整证生成、修复及附加上下文的有效性外，我们还表明Baldur相较于现有最优工具Thor，自动生成的证明定理数量额外提升了8.7%。Baldur与Thor共同协作，能够全自动证明65.7%的定理。本文为利用大语言模型自动化形式化验证的研究开辟了新方向。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日