We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.
翻译:我们推出了DafnyBench,这是迄今为止用于训练和评估机器学习系统进行形式化软件验证的最大规模基准测试。我们测试了GPT-4和Claude 3等大型语言模型自动生成足够提示的能力,以协助Dafny形式化验证引擎成功验证超过750个程序(约53,000行代码)。最佳模型与提示方案实现了68%的成功率,并量化了在引入错误信息反馈后重试时成功率的提升幅度,以及随所需代码量和提示量增加而下降的趋势。我们期望随着大型语言模型与验证技术质量的提升,DafnyBench能够推动该基线水平实现快速改进。