Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs

LLM-based assistants, such as GitHub Copilot and ChatGPT, have the potential to generate code that fulfills a programming task described in a natural language description, referred to as a prompt. The widespread accessibility of these assistants enables users with diverse backgrounds to generate code and integrate it into software projects. However, studies show that code generated by LLMs is prone to bugs and may miss various corner cases in task specifications. Presenting such buggy code to users can impact their reliability and trust in LLM-based assistants. Moreover, significant efforts are required by the user to detect and repair any bug present in the code, especially if no test cases are available. In this study, we propose a self-refinement method aimed at improving the reliability of code generated by LLMs by minimizing the number of bugs before execution, without human intervention, and in the absence of test cases. Our approach is based on targeted Verification Questions (VQs) to identify potential bugs within the initial code. These VQs target various nodes within the Abstract Syntax Tree (AST) of the initial code, which have the potential to trigger specific types of bug patterns commonly found in LLM-generated code. Finally, our method attempts to repair these potential bugs by re-prompting the LLM with the targeted VQs and the initial code. Our evaluation, based on programming tasks in the CoderEval dataset, demonstrates that our proposed method outperforms state-of-the-art methods by decreasing the number of targeted errors in the code between 21% to 62% and improving the number of executable code instances to 13%.

翻译：基于大语言模型（LLM）的编程助手，如GitHub Copilot和ChatGPT，能够根据自然语言描述（即提示词）生成满足编程任务的代码。这些助手的广泛普及使得不同背景的用户都能生成代码并将其集成到软件项目中。然而，研究表明LLM生成的代码容易存在缺陷，且可能遗漏任务规范中的各种边界情况。向用户呈现此类有缺陷的代码会影响他们对LLM助手的可靠性和信任度。此外，用户需要付出大量精力来检测和修复代码中的错误，特别是在缺乏测试用例的情况下。本研究提出一种自优化方法，旨在通过最小化执行前的代码缺陷数量来提高LLM生成代码的可靠性，该方法无需人工干预且不依赖测试用例。我们的方法基于针对性验证问题（VQs）来识别初始代码中的潜在缺陷。这些VQs针对初始代码抽象语法树（AST）中的各类节点，这些节点可能触发LLM生成代码中常见的特定类型缺陷模式。最后，我们的方法尝试通过将针对性VQs与初始代码重新提示给LLM来修复这些潜在缺陷。基于CoderEval数据集中编程任务的评估表明，我们提出的方法优于现有最先进方法，将代码中的目标错误数量降低了21%至62%，并将可执行代码实例的比例提升了13%。