Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis

Large Language Models (LLMs) are one of the most promising developments in the field of artificial intelligence, and the software engineering community has readily noticed their potential role in the software development life-cycle. Developers routinely ask LLMs to generate code snippets, increasing productivity but also potentially introducing ownership, privacy, correctness, and security issues. Previous work highlighted how code generated by mainstream commercial LLMs is often not safe, containing vulnerabilities, bugs, and code smells. In this paper, we present a framework that leverages testing and static analysis to assess the quality, and guide the self-improvement, of code generated by general-purpose, open-source LLMs. First, we ask LLMs to generate C code to solve a number of programming tasks. Then we employ ground-truth tests to assess the (in)correctness of the generated code, and a static analysis tool to detect potential safety vulnerabilities. Next, we assess the models ability to evaluate the generated code, by asking them to detect errors and vulnerabilities. Finally, we test the models ability to fix the generated code, providing the reports produced during the static analysis and incorrectness evaluation phases as feedback. Our results show that models often produce incorrect code, and that the generated code can include safety issues. Moreover, they perform very poorly at detecting either issue. On the positive side, we observe a substantial ability to fix flawed code when provided with information about failed tests or potential vulnerabilities, indicating a promising avenue for improving the safety of LLM-based code generation tools.

翻译：大语言模型（LLMs）是人工智能领域最具前景的发展之一，软件工程界已敏锐注意到其在软件开发生命周期中的潜在作用。开发者常要求LLMs生成代码片段，这虽提升了生产效率，但也可能引发所有权、隐私、正确性与安全性问题。已有研究指出，主流商用LLMs生成的代码常存在安全隐患，包含漏洞、缺陷及代码异味。本文提出一种框架，通过结合测试与静态分析来评估通用开源LLMs生成代码的质量，并引导其自我改进。首先，我们要求LLMs生成解决若干编程任务的C语言代码；随后利用标准测试集评估生成代码的正确性，并借助静态分析工具检测潜在安全漏洞；接着，通过要求模型识别错误与漏洞来评估其代码审查能力；最后，我们将静态分析及正确性评估阶段生成的报告作为反馈，测试模型修复生成代码的能力。实验结果表明，模型生成的代码常存在错误且可能包含安全隐患，同时模型在识别这两类问题方面表现欠佳。积极的一面是，当提供测试失败信息或潜在漏洞报告时，模型展现出显著修复缺陷代码的能力，这为提升基于LLM的代码生成工具的安全性指明了可行路径。