The use of large language models for code generation is a rapidly growing trend in software development. However, without effective methods for ensuring the correctness of generated code, this trend could lead to any number of undesirable outcomes. In this paper, we lay out a vision for addressing this challenge: the Clover paradigm, short for Closed-Loop Verifiable Code Generation, which reduces correctness checking to the more accessible problem of consistency checking. At the core of Clover lies a checker that performs consistency checks among code, docstrings, and formal annotations. The checker is implemented using a novel integration of formal verification tools and large language models. We provide a theoretical analysis to support our thesis that Clover should be effective at consistency checking. We also empirically investigate its feasibility on a hand-designed dataset (CloverBench) featuring annotated Dafny programs at a textbook level of difficulty. Experimental results show that for this dataset, (i) LLMs are reasonably successful at automatically generating formal specifications; and (ii) our consistency checker achieves a promising acceptance rate (up to 87%) for correct instances while maintaining zero tolerance for incorrect ones (no false positives).
翻译:大规模语言模型在代码生成中的使用是软件开发中迅速增长的趋势。然而,若无有效方法确保生成代码的正确性,这一趋势可能导致诸多不良后果。本文提出应对此挑战的愿景:Clover范式(闭环可验证代码生成),其将正确性检查简化为更易处理的一致性检查问题。Clover的核心是一个在代码、文档字符串和形式标注之间执行一致性检查的检查器。该检查器通过形式验证工具与大规模语言模型的新型集成实现。我们提供理论分析支持Clover在一致性检查中应有效的论点,并通过手工设计的CloverBench数据集(包含教科书难度级别的Dafny程序标注)实证研究其可行性。实验结果表明:在该数据集上,(i) 大语言模型能较成功地自动生成形式规约;(ii) 我们的一致性检查器对正确实例实现了有前景的接受率(高达87%),同时保持对错误实例的零容忍(无假阳性)。