Formal verification using proof assistants, such as Coq, is an effective way of improving software quality, but requires significant effort and expertise. Machine learning can automatically synthesize proofs, but such tools are able to prove only a fraction of desired software properties. We introduce Cobblestone, a divide-and-conquer approach for proof synthesis. Cobblestone uses a large language model (LLM) to generate potential proofs, uses those proofs to break the problem into simpler parts, automatically identifies which of those parts were successfully proven, and iterates on the remaining parts to build a correct proof that is guaranteed to be sound, despite the reliance on unsound LLMs. We evaluate Cobblestone on four benchmarks of open-source Coq projects, controlling for training data leakage. Fully automatically, Cobblestone outperforms state-of-the-art non-LLM tools, and proves many theorems that other LLM-based tools cannot, and on many benchmarks, outperforms them. Each Cobblestone run costs only $1.25 and takes 14.7 minutes, on average. Cobblestone can also be used with external input, from a user or another tool, providing a proof structure or relevant lemmas. Evaluated with such an oracle, Cobblestone proves up to 58% of theorems. Overall, our research shows that tools can make use of partial progress and external input to more effectively automate formal verification.
翻译:使用证明助手(如Coq)进行形式验证是提升软件质量的有效手段,但需要大量专业知识和人力投入。机器学习技术可自动合成证明,然而现有工具仅能证明少数软件属性。我们提出Cobblestone——一种面向证明合成的分治方法。该方法利用大语言模型(LLM)生成候选证明,并基于这些证明将问题分解为更简单的子问题;通过自动识别已成功证明的子问题,并对剩余部分迭代优化,最终构建出具有正确性保证的可靠证明——即便依赖于可能产生错误结果的LLM。我们在四个开源Coq项目的基准测试中评估了Cobblestone,并有效控制了训练数据泄露风险。在完全自动化场景下,Cobblestone的性能超越了现有非LLM工具,能证明其他基于LLM的工具无法证明的众多定理,且在多个基准测试中表现更优。每次运行平均仅需1.25美元、耗时14.7分钟。此外,Cobblestone可集成用户或其他工具的外部输入,提供证明结构或相关引理。在接入此类预言机后,Cobblestone的定理证明成功率可达58%。总体而言,本研究表明通过利用局部进展和外部输入,工具能更有效地实现形式验证自动化。