Program verification relies on loop invariants, yet automatically discovering strong invariants remains a long-standing challenge. We investigate whether large language models (LLMs) can accelerate program verification by generating useful loop invariants. We introduce Quokka, an evaluation-oriented framework for LLM-based invariant synthesis that provides sound evaluation and achieves state-of-the-art performance. Unlike prior work that treats LLM outputs as noisy symbolic material requiring substantial post-processing, Quokka adopts a simpler and evaluation-centric design that directly validates whether each LLM-generated invariant helps prove the target assertion. We construct a benchmark of 866 instances derived from SV-COMP and evaluate 9 state-of-the-art LLMs across multiple model families. We demonstrate that supervised fine-tuning and Best-of-N sampling yield measurable improvements, and we show that Quokka consistently outperforms prior LLM-based verifiers. Our code and data are publicly available at https://github.com/Anjiang-Wei/Quokka
翻译:程序验证依赖于循环不变量,但自动发现强不变量仍是一个长期挑战。我们探究大语言模型能否通过生成有用的循环不变量来加速程序验证。本文提出Quokka,一个面向评估的基于大语言模型的不变量合成框架,该框架提供可靠的评估并达到最先进的性能。与将大语言模型输出视为需要大量后处理的噪声符号材料的先前工作不同,Quokka采用更简单且以评估为中心的设计,直接验证每个由大语言模型生成的不变量是否有助于证明目标断言。我们构建了源自SV-COMP的866个实例基准,并评估了来自多个模型家族的9个最先进的大语言模型。我们证明监督微调和Best-of-N采样能带来可衡量的改进,并表明Quokka始终优于先前基于大语言模型的验证器。我们的代码和数据已在https://github.com/Anjiang-Wei/Quokka公开。