Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier.
翻译:合成归纳循环不变量是实现程序验证自动化的基础。本工作观察到,大型语言模型(如gpt-3.5或gpt-4)能够在零样本设置下为某类程序合成循环不变量,但生成正确不变量通常需要多个采样结果。这可能导致为建立不变量而大量调用程序验证器。为解决该问题,我们提出一种针对LLM生成结果的{重排序}方法。基于问题定义,我们设计了一个能够区分正确归纳不变量与错误尝试的排序器,该排序器以对比学习方式进行优化。实验结果表明,该重排序机制显著提升了正确不变量在生成候选集内的排序位置,进而有效减少了验证器的调用次数。