FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving

Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: https://fveler.github.io/.

翻译：随着当前大语言模型（LLM）的不断发展，程序合成技术日益兴起，形式化验证（FV）的重要性也日益凸显。然而，现有的形式化验证主要依赖于符号验证器或手工编写的规则，导致其在大规模和灵活验证方面存在局限性。另一方面，用于自动定理证明的形式化语言（如Isabelle）作为另一种严格的验证途径，拥有完善且系统化的规则与定理库。本文提出FVEL，一个基于LLM的交互式形式化验证环境。具体而言，FVEL将待验证代码转换为Isabelle表示，然后通过LLM驱动的神经自动定理证明进行验证。该融合范式充分利用了Isabelle中严谨且丰富的结构化规则体系，同时也便于引入和调整前沿的LLM技术。为实现这一目标，我们构建了一个大规模数据集FVELER3。FVELER数据集包含了以Isabelle形式表述的代码依赖关系与验证过程，总计涵盖758个理论、29,125条引理和200,646个具有深层依赖关系的证明步骤。我们在FVEL环境中对FVELER进行基准测试：首先使用FVELER对LLM进行微调，随后在Code2Inv和SV-COMP数据集上进行评估。结果表明，经过FVELER微调的Llama3-8B模型在FVEL框架下，于SV-COMP中解决的问题数量提升了17.39%（从69题增至81题），Mistral-7B模型则提升了12%（从75题增至84题），且证明错误比例显著降低。项目主页：https://fveler.github.io/。