Is Your AI-Generated Code Really Secure? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

Large language models (LLMs) have brought significant advancements to code generation and code repair, benefiting both novice and experienced developers. However, their training using unsanitized data from open-source repositories, like GitHub, raises the risk of inadvertently propagating security vulnerabilities. Despite numerous studies investigating the safety of code LLMs, there remains a gap in comprehensively addressing their security features. In this work, we aim to present a comprehensive study aimed at precisely evaluating and enhancing the security aspects of code LLMs. To support our research, we introduce CodeSecEval, a meticulously curated dataset designed to address 44 critical vulnerability types with 180 distinct samples. CodeSecEval serves as the foundation for the automatic evaluation of code models in two crucial tasks: code generation and code repair, with a strong emphasis on security. Our experimental results reveal that current models frequently overlook security issues during both code generation and repair processes, resulting in the creation of vulnerable code. In response, we propose different strategies that leverage vulnerability-aware information and insecure code explanations to mitigate these security vulnerabilities. Furthermore, our findings highlight that certain vulnerability types particularly challenge model performance, influencing their effectiveness in real-world applications. Based on these findings, we believe our study will have a positive impact on the software engineering community, inspiring the development of improved methods for training and utilizing LLMs, thereby leading to safer and more trustworthy model deployment.

翻译：大语言模型（LLMs）为代码生成与代码修复领域带来了显著进步，使新手与经验丰富的开发者均能受益。然而，这些模型使用来自开源仓库（如GitHub）未经净化的数据进行训练，存在无意间传播安全漏洞的风险。尽管已有大量研究探讨代码大语言模型的安全性，但在全面评估其安全特性方面仍存在空白。本研究旨在通过一项综合性研究，精确评估并增强代码大语言模型的安全维度。为支撑研究，我们提出了CodeSecEval——一个精心构建的数据集，涵盖44类关键漏洞类型并包含180个独立样本。CodeSecEval为代码模型在两项关键任务（代码生成与代码修复）中的自动化评估提供了基础，且特别聚焦于安全性。实验结果表明，当前模型在代码生成与修复过程中常忽视安全问题，导致产生含漏洞的代码。为此，我们提出了多种策略，通过利用漏洞感知信息和不安全代码解释来缓解这些安全隐患。此外，研究发现某些漏洞类型尤其影响模型性能，进而制约其在实际应用中的有效性。基于这些发现，我们相信本研究将对软件工程社区产生积极影响，推动开发更优的大语言模型训练与使用方法，从而实现更安全、更可信的模型部署。