Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier.
翻译:归纳循环不变量的合成是自动化程序验证的基础。本研究发现,大型语言模型(如gpt-3.5或gpt-4)能够在零样本设置下为一类程序合成循环不变量,但需要多次采样才能生成正确的不变量。这可能导致需要大量调用程序验证器来确立不变量。为解决该问题,我们提出了一种针对LLM生成结果的{重排序}方法。我们设计了一个基于问题定义区分正确归纳不变量与错误尝试的排序器,该排序器通过对比学习进行优化。实验结果表明,该重排序机制显著提升了正确不变量在生成候选中的排序,从而大幅减少了验证器的调用次数。