FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving

Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: https://fveler.github.io/.

翻译：随着当前不断演进的大型语言模型（LLMs）推动程序合成的发展，形式化验证（FV）的重要性日益凸显。然而，现有的形式化验证主要依赖于符号验证器或手工编写的规则，导致其在大规模、灵活验证方面存在局限。另一方面，用于自动定理证明的形式化语言（如Isabelle）作为另一条严格的验证路径，维护着完备的规则与定理体系。本文提出FVEL，一个基于LLMs的交互式形式化验证环境。具体而言，FVEL将待验证代码转换为Isabelle表述，随后通过基于LLM的神经自动定理证明进行验证。该融合范式既利用了Isabelle中严谨且体系化的丰富规则，也便于引入和调整前沿的LLMs。为实现此目标，我们构建了大规模数据集FVELER3。该数据集包含以Isabelle形式表述的代码依赖关系与验证过程，涵盖758个理论、29,125条引理及总计200,646个具有深层依赖关系的证明步骤。我们在FVEL环境中对FVELER进行基准测试：首先使用FVELER对LLMs进行微调，随后在Code2Inv和SV-COMP数据集上进行评估。结果表明，经FVELER微调的Llama3-8B版FVEL在SV-COMP中多解决了17.39%的问题（69→81），Mistral-7B版多解决了12%的问题（75→84），且证明错误比例显著降低。项目主页：https://fveler.github.io/。