Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiability Modulo Theories (SMT) solvers to automate proofs in languages such as F*. Seeking to spur research on using AI to automate the construction of proof-oriented programs, we curate a dataset of 600K lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux, to Python and Firefox. Our dataset includes around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem -- producing a definition given a formal specification expressed as an F* type. We provide a program-fragment checker that queries F* to check the correctness of candidate solutions. We believe this is the largest corpus of SMT-assisted program proofs coupled with a reproducible program-fragment checker. Grounded in this dataset, we investigate the use of AI to synthesize programs and their proofs in F*, with promising results. Our main finding in that the performance of fine-tuned smaller language models (such as Phi-2 or StarCoder) compare favorably with large language models (such as GPT-4), at a much lower computational cost. We also identify various type-based retrieval augmentation techniques and find that they boost performance significantly. With detailed error analysis and case studies, we identify potential strengths and weaknesses of models and techniques and suggest directions for future improvements.
翻译:面向证明的程序将计算内容与程序正确性的证明相结合。然而,尽管在F*等语言中使用可满足性模理论(SMT)求解器来自动化证明,编程和证明所需的人力仍然巨大。为了推动利用人工智能自动化构建面向证明的程序的研究,我们整理了一个包含60万行开源F*程序及证明的数据集,其中涵盖了从Windows、Linux到Python和Firefox等生产系统中使用的软件。该数据集包含约3.2万个顶层F*定义,每个定义代表一个类型导向的程序与证明合成问题——即根据以F*类型表示的形式规范生成定义。我们提供了一个程序片段检查器,可查询F*以验证候选方案的正确性。我们相信这是最大的SMT辅助程序证明语料库,并配有可复现的程序片段检查器。基于该数据集,我们探索了利用人工智能在F*中合成程序及其证明的方法,并取得了有前景的结果。主要发现是:微调后的较小语言模型(如Phi-2或StarCoder)的性能可与大型语言模型(如GPT-4)相媲美,且计算成本显著更低。我们还识别了多种基于类型的检索增强技术,发现它们能大幅提升性能。通过详细的错误分析和案例研究,我们揭示了模型与技术的潜在优劣,并为未来改进指明了方向。