Program verification is vital for ensuring software reliability, especially in the context of increasingly complex systems. Loop invariants, remaining true before and after each iteration of loops, are crucial for this verification process. Traditional provers and machine learning based methods for generating loop invariants often require expert intervention or extensive labeled data, and typically only handle numerical property verification. These methods struggle with programs involving complex data structures and memory manipulations, limiting their applicability and automation capabilities. In this paper, we introduce a new benchmark named LIG-MM, specifically for programs with complex data structures and memory manipulations. We collect 312 programs from various sources, including daily programs from college homework, the international competition (SV-COMP), benchmarks from previous papers (SLING), and programs from real-world software systems (Linux Kernel, GlibC, LiteOS, and Zephyr). Based on LIG-MM, our findings indicate that previous methods, including GPT-4, fail to automate verification for these programs. Consequently, we propose a novel LLM-SE framework that coordinates LLM with symbolic execution, fine-tuned using self-supervised learning, to generate loop invariants. Experimental results on LIG-MM demonstrate that our LLM-SE outperforms state-of-the-art methods, offering a new direction toward automated program verification in real-world scenarios.
翻译:程序验证对于确保软件可靠性至关重要,尤其是在系统复杂度日益增长的背景下。循环不变式在循环的每次迭代前后均保持为真,是该验证过程的关键要素。传统定理证明器与基于机器学习的方法在生成循环不变式时,通常需要专家干预或大量标注数据,且仅能处理数值属性验证。这些方法难以应对涉及复杂数据结构与内存操作的程序,限制了其适用性与自动化能力。本文提出一个名为LIG-MM的新基准测试,专门针对包含复杂数据结构与内存操作的程序。我们从大学课程作业、国际竞赛(SV-COMP)、先前论文基准测试(SLING)以及真实软件系统(Linux内核、GlibC、LiteOS与Zephyr)等多个来源收集了312个程序。基于LIG-MM的评估表明,包括GPT-4在内的既有方法均无法实现对这些程序的自动化验证。为此,我们提出一种新型LLM-SE框架,该框架通过协调大语言模型与符号执行,并采用自监督学习进行微调,以生成循环不变式。在LIG-MM上的实验结果表明,我们的LLM-SE方法优于现有最先进技术,为真实场景下的自动化程序验证提供了新方向。