Generating code via a LLM (rather than writing code from scratch), has exploded in popularity. However, the security implications of LLM-generated code are still unknown. We performed a study that compared the security and quality of human-written code with that of LLM-generated code, for a wide range of programming tasks, including data structures, algorithms, cryptographic routines, and LeetCode questions. To assess code security we used unit testing, fuzzing, and static analysis. For code quality, we focused on complexity and size. We found that LLM can generate incorrect code that fails to implement the required functionality, especially for more complicated tasks; such errors can be subtle. For example, for the cryptographic algorithm SHA1, LLM generated an incorrect implementation that nevertheless compiles. In cases where its functionality was correct, we found that LLM-generated code is less secure, primarily due to the lack of defensive programming constructs, which invites a host of security issues such as buffer overflows or integer overflows. Fuzzing has revealed that LLM-generated code is more prone to hangs and crashes than human-written code. Quality-wise, we found that LLM generates bare-bones code that lacks defensive programming constructs, and is typically more complex (per line of code) compared to human-written code. Next, we constructed a feedback loop that asked the LLM to re-generate the code and eliminate the found issues (e.g., malloc overflow, array index out of bounds, null dereferences). We found that the LLM fails to eliminate such issues consistently: while succeeding in some cases, we found instances where the re-generated, supposedly more secure code, contains new issues; we also found that upon prompting, LLM can introduce issues in files that were issues-free before prompting.
翻译:通过大型语言模型(而非从零开始编写代码)生成代码的做法已迅速普及。然而,LLM生成代码的安全影响尚不明确。我们开展了一项研究,在数据结构、算法、密码学例程及LeetCode问题等广泛编程任务中,对比了人工编写代码与LLM生成代码的安全性和质量。为评估代码安全性,我们采用了单元测试、模糊测试和静态分析。针对代码质量,我们重点关注复杂性和规模。研究发现:LLM可能生成无法实现所需功能的不正确代码,尤其在复杂任务中;此类错误可能极为隐蔽。例如在SHA1密码算法中,LLM生成了可通过编译的错误实现。在功能正确的案例中,LLM生成代码因缺乏防御性编程结构而安全性较低,易引发缓冲区溢出或整数溢出等安全问题。模糊测试显示,相比人工编写代码,LLM生成代码更易出现挂起和崩溃。质量方面,LLM生成的代码缺乏防御性编程结构,且通常(每行代码)比人工编写代码更复杂。随后,我们构建反馈循环要求LLM重新生成代码以消除已发现问题(如malloc溢出、数组越界、空指针解引用)。研究发现LLM无法持续消除此类问题:虽在某些案例中成功,但重生成的"更安全"代码仍存在新问题;我们还发现经提示后,LLM可能在原本无问题的文件中引入新缺陷。