The use of large language models for code generation is a rapidly growing trend in software development. However, without effective methods for ensuring the correctness of generated code, this trend could lead to any number of undesirable outcomes. In this paper, we lay out a vision for addressing this challenge: the Clover paradigm, short for Closed-Loop Verifiable Code Generation, which reduces correctness checking to the more accessible problem of consistency checking. At the core of Clover lies a checker that performs consistency checks among code, docstrings, and formal annotations. The checker is implemented using a novel integration of formal verification tools and large language models. We provide a theoretical analysis to support our thesis that Clover should be effective at consistency checking. We also empirically investigate its feasibility on a hand-designed dataset (CloverBench) featuring annotated Dafny programs at a textbook level of difficulty. Experimental results show that for this dataset, (i) LLMs are reasonably successful at automatically generating formal specifications; and (ii) our consistency checker achieves a promising acceptance rate (up to 87%) for correct instances while maintaining zero tolerance for incorrect ones (no false positives).
翻译:利用大型语言模型进行代码生成是软件开发中迅速增长的趋势。然而,若缺乏确保生成代码正确性的有效方法,这一趋势可能导致诸多不良后果。本文提出了应对这一挑战的愿景:Clover范式(闭环可验证代码生成),该范式将正确性检查简化为更易处理的代码一致性验证问题。Clover的核心是一个检查器,用于执行代码、文档字符串与形式化注解之间的一致性验证。该检查器通过形式化验证工具与大型语言模型的新颖融合实现。我们通过理论分析论证了Clover在一致性检查方面的有效性,并在手工构建的数据集(CloverBench)上实证研究了其可行性,该数据集包含教科书难度级别的带注解Dafny程序。实验结果表明:(i)对于该数据集,大型语言模型能够较成功地自动生成形式化规范;(ii)我们的一致性检查器对正确样本达到了较高的接受率(最高87%),同时对错误样本保持零容忍(无误报)。