Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier. The source code and the experimental data for this paper are available in \url{https://github.com/microsoft/NeuralInvariantRanker}.
翻译:归纳循环不变量的综合是自动化程序验证的基础。在本工作中,我们观察到大型语言模型(如gpt-3.5或gpt-4)能够在零样本设置下综合一类程序的循环不变量,但需要多个样本才能生成正确的不变量。这可能导致大量调用程序验证器来建立不变量。为解决此问题,我们提出了一种对LLM生成结果进行重排序的方法。我们设计了一个基于问题定义区分正确归纳不变量与错误尝试的排序器。该排序器被优化为对比排序器。实验结果表明,这种重排序机制显著提升了正确不变量在生成候选中的排序位次,从而显著减少了对验证器的调用次数。本文的源代码和实验数据可从\url{https://github.com/microsoft/NeuralInvariantRanker}获取。