Theorem proving is fundamental to program verification, where the automated proof of Verification Conditions (VCs) remains a primary bottleneck. Real-world program verification frequently encounters hard VCs that existing Automated Theorem Provers (ATPs) cannot prove, leading to a critical need for extensive manual proofs that burden practical application. While Neural Theorem Proving (NTP) has achieved significant success in mathematical competitions, demonstrating the potential of machine learning approaches to formal reasoning, its application to program verification--particularly VC proving--remains largely unexplored. Despite existing work on annotation synthesis and verification-related theorem proving, no benchmark has specifically targeted this fundamental bottleneck: automated VC proving. This work introduces Neural Theorem Proving for Verification Conditions (NTP4VC), presenting the first real-world multi-language benchmark for this task. From real-world projects such as Linux and Contiki-OS kernel, our benchmark leverages industrial pipelines (Why3 and Frama-C) to generate semantically equivalent test cases across formal languages of Isabelle, Lean, and Rocq. We evaluate large language models (LLMs), both general-purpose and those fine-tuned for theorem proving, on NTP4VC. Results indicate that although LLMs show promise in VC proving, significant challenges remain for program verification, highlighting a large gap and opportunity for future research.
翻译:定理证明是程序验证的基础,其中验证条件(VCs)的自动化证明仍是主要瓶颈。真实世界的程序验证经常遇到现有自动定理证明器(ATPs)无法证明的困难VCs,导致迫切需要大量手动证明,这给实际应用带来了沉重负担。尽管神经定理证明(NTP)已在数学竞赛中取得显著成功,展示了机器学习方法在形式推理方面的潜力,但其在程序验证——特别是VC证明——中的应用在很大程度上仍未得到探索。尽管现有工作在标注综合和与验证相关的定理证明方面有所进展,但尚未有基准专门针对这一基本瓶颈:自动化VC证明。本研究引入了用于验证条件的神经定理证明(NTP4VC),提出了首个针对此任务的真实世界多语言基准。基于Linux和Contiki-OS内核等真实项目,我们的基准利用工业级工具链(Why3和Frama-C)生成在Isabelle、Lean和Rocq等形式语言中语义等价的测试用例。我们在NTP4VC上评估了大型语言模型(LLMs),包括通用模型和专门针对定理证明进行微调的模型。结果表明,尽管LLMs在VC证明中显示出潜力,但程序验证仍面临重大挑战,突显了未来研究存在巨大差距和机遇。