Formally verifying software properties is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time, and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once, rather than one step at a time. We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power. As its main contributions, this paper demonstrates for the first time that: (1) Whole-proof generation using transformers is possible and is as effective as search-based techniques without requiring costly search. (2) Giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair and further improves automated proof generation. (3) We establish a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs. In addition to empirically showing the effectiveness of whole-proof generation, repair, and added context, we show that Baldur improves on the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification.
翻译:摘要:形式化验证软件属性是一项极具价值但劳动密集型的任务。近期工作开发了使用证明助手(如Coq和Isabelle/HOL)自动化形式化验证的方法,例如通过训练模型逐步骤预测证明,并利用该模型在可能的证明空间中进行搜索。本文提出一种自动化形式化验证的新方法:我们使用在大规模自然语言文本和代码上预训练、并在证明数据上微调的大语言模型,一次性生成定理的完整证明,而非逐步骤生成。我们将该证明生成模型与微调的修复模型相结合,以修复已生成的证明,从而进一步提升证明能力。作为主要贡献,本文首次证明:(1)基于Transformer的整证生成是可行的,其效果与基于搜索的技术相当,且无需代价高昂的搜索过程。(2)为学习模型提供额外上下文(例如先前的失败证明尝试及其产生的错误信息)可实现证明修复,并进一步改进自动化证明生成。(3)我们为完全自动化的证明合成建立了新的最优水平。我们将该方法实例化于原型系统Baldur,并在包含6,336个Isabelle/HOL定理及其证明的基准测试上进行评估。除实证展示整证生成、修复及上下文增强的有效性外,我们还表明Baldur相比当前最优工具Thor,可额外自动生成8.7%定理的证明。Baldur与Thor联合可完全自动化证明65.7%的定理。本文为利用大语言模型自动化形式化验证的新研究奠定了基础。