Formally verifying software properties is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time, and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once, rather than one step at a time. We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power. As its main contributions, this paper demonstrates for the first time that: (1) Whole-proof generation using transformers is possible and is as effective as search-based techniques without requiring costly search. (2) Giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair and further improves automated proof generation. (3) We establish a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs. In addition to empirically showing the effectiveness of whole-proof generation, repair, and added context, we show that Baldur improves on the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification.
翻译:摘要:对软件性质进行形式化验证是一项高度理想但劳动密集型的任务。近期研究开发了利用Coq和Isabelle/HOL等证明辅助工具自动化形式化验证的方法,例如通过训练模型每次预测一个证明步骤,并利用该模型在可能的证明空间中进行搜索。本文提出了一种自动化形式化验证的新方法:我们使用在大规模自然语言文本和代码上训练、并在证明数据上微调的大语言模型,一次性生成定理的完整证明,而非逐步生成。我们将此证明生成模型与微调后的修复模型相结合,以修复生成的证明,从而进一步提升证明能力。本文的主要贡献在于首次证明:(1)使用Transformer进行整证生成是可行的,且其效果与基于搜索的技术相当,同时无需昂贵的搜索过程。(2)为学习模型提供额外上下文(例如先前的失败证明尝试及随之而来的错误信息)能够实现证明修复,并进一步改善自动化证明生成。(3)我们建立了全自动证明合成的新标杆。我们将该方法实现为原型系统Baldur,并在包含6,336个Isabelle/HOL定理及其证明的基准测试上进行了评估。除实证展示整证生成、修复及附加上下文的有效性外,我们还表明Baldur相较于现有最优工具Thor,自动生成的证明定理数量额外提升了8.7%。Baldur与Thor共同协作,能够全自动证明65.7%的定理。本文为利用大语言模型自动化形式化验证的研究开辟了新方向。