Ensuring correctness is crucial for code generation. Formal verification offers a definitive assurance of correctness, but demands substantial human effort in proof construction and hence raises a pressing need for automation. The primary obstacle lies in the severe lack of data-there is much fewer proofs than code snippets for Large Language Models (LLMs) to train upon. In this paper, we introduce SAFE, a framework that overcomes the lack of human-written proofs to enable automated proof generation of Rust code. SAFE establishes a self-evolving cycle where data synthesis and fine-tuning collaborate to enhance the model capability, leveraging the definitive power of a symbolic verifier in telling correct proofs from incorrect ones. SAFE also re-purposes the large number of synthesized incorrect proofs to train the self-debugging capability of the fine-tuned models, empowering them to fix incorrect proofs based on the verifier's feedback. SAFE demonstrates superior efficiency and precision compared to GPT-4o. Through tens of thousands of synthesized proofs and the self-debugging mechanism, we improve the capability of open-source models, initially unacquainted with formal verification, to automatically write proofs for Rust code. This advancement leads to a significant improvement in performance, achieving a 52.52% accuracy rate in a benchmark crafted by human experts, a significant leap over GPT-4o's performance of 14.39%.
翻译:确保代码生成过程的正确性至关重要。形式化验证为正确性提供了确定性保证,但需要大量人工精力进行证明构造,因此对自动化提出了迫切需求。主要障碍在于数据的严重匮乏——可供大型语言模型(LLMs)训练的证明数量远少于代码片段。本文提出SAFE框架,该框架克服了人工编写证明的不足,实现了Rust代码的自动证明生成。SAFE建立了一个自演化循环,其中数据合成与微调协同工作以增强模型能力,并利用符号验证器在区分正确与错误证明方面的确定性能力。SAFE还重新利用了大量合成的错误证明来训练微调模型的自我调试能力,使其能够根据验证器的反馈修正错误证明。与GPT-4o相比,SAFE展现出卓越的效率和精度。通过数万个合成证明及自我调试机制,我们提升了开源模型(最初不熟悉形式化验证)为Rust代码自动编写证明的能力。这一进步带来了性能的显著提升,在由人类专家构建的基准测试中达到了52.52%的准确率,较GPT-4o的14.39%实现了重大飞跃。