Towards AI-Assisted Synthesis of Verified Dafny Methods

Large stochastic language models show great promise in many domains, including programming. A promise is easy to make but hard to keep, and language models often fail to keep their promises when applied to programming, generating erroneous code. One promising avenue to keep models honest is to have them generate code in a language that supports formal verification: if and when that is adopted, the model would provide proof along with the code, and that proof would be automatically verified. Unfortunately, existing large language models show a severe lack of proficiency in verified programming languages. In this paper we demonstrate how to improve two pretrained models' proficiency in the Dafny verified programming language. Using 178 programming problems from the MBPP dataset, we prompt two contemporary models (GPT-4 and PaLM-2) to generate methods in Dafny. We use three different types of prompts: a direct contextless prompt, a second one that includes a signature of the method and test cases, and a third one that decomposes the problem into steps and includes dynamically chosen similar examples. Our results show that GPT-4 is better than PaLM-2, but that, in both models, the third prompt greatly improves the success of the generation task for the direct prompt. With the third prompt, GPT-4 was able to generate verified (and human-evaluated) Dafny methods in 58% of the cases, while the first prompt generated verified (and human-evaluated) methods in only 19% of the cases. Surprisingly, the second prompt had the worst performance, with only 10%. One tangible contribution of our work is a collection of 153 MBPP problems that are implemented and formally verified in Dafny, 50 of which were written by us and 103 were automatically synthesized by GPT-4. Additionally, our results demonstrate that the benefits of formal program verification (proof of correctness) are now within reach...

翻译：大型随机语言模型在许多领域展现出巨大潜力，包括编程领域。承诺容易兑现难，语言模型应用于编程时常常无法信守承诺，生成错误代码。让模型保持可信的一个有前景的途径是，使其以支持形式验证的语言生成代码：若采用该方式，模型将随代码一起提供证明，且该证明能够自动通过验证。遗憾的是，现有大型语言模型在验证编程语言方面表现出严重的技能缺陷。本文展示了如何提升两个预训练模型在Dafny验证编程语言上的熟练度。利用MBPP数据集中的178个编程问题，我们引导两个当代模型（GPT-4和PaLM-2）用Dafny生成方法。我们采用三种不同类型的提示：第一种是无上下文的直接提示；第二种包含方法签名和测试用例；第三种则将问题分解为步骤并动态选取相似示例。结果表明，GPT-4优于PaLM-2，但两种模型在第三种提示下，相比直接提示，生成任务的成功率均大幅提升。使用第三种提示时，GPT-4在58%的情况下能生成经人工评估验证的Dafny方法，而第一种提示仅能在19%的情况下生成此类方法。令人惊讶的是，第二种提示表现最差，成功率仅为10%。本研究的一项具体贡献在于，我们构建了一个包含153个MBPP问题的集合，这些问题均已用Dafny实现并通过形式验证，其中50个由我们手动编写，103个由GPT-4自动合成。此外，我们的研究结果表明，形式程序验证（正确性证明）的优势现已触手可及……