Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

Large language models (LLMs) have brought significant advancements to code generation and code repair, benefiting both novice and experienced developers. However, their training using unsanitized data from open-source repositories, like GitHub, raises the risk of inadvertently propagating security vulnerabilities. Despite numerous studies investigating the safety of code LLMs, there remains a gap in comprehensively addressing their security features. In this work, we aim to present a comprehensive study aimed at precisely evaluating and enhancing the security aspects of code LLMs. To support our research, we introduce CodeSecEval, a meticulously curated dataset designed to address 44 critical vulnerability types with 180 distinct samples. CodeSecEval serves as the foundation for the automatic evaluation of code models in two crucial tasks: code generation and code repair, with a strong emphasis on security. Our experimental results reveal that current models frequently overlook security issues during both code generation and repair processes, resulting in the creation of vulnerable code. In response, we propose different strategies that leverage vulnerability-aware information and insecure code explanations to mitigate these security vulnerabilities. Furthermore, our findings highlight that certain vulnerability types particularly challenge model performance, influencing their effectiveness in real-world applications. Based on these findings, we believe our study will have a positive impact on the software engineering community, inspiring the development of improved methods for training and utilizing LLMs, thereby leading to safer and more trustworthy model deployment.

翻译：大语言模型（LLMs）在代码生成和代码修复方面带来了显著进步，使新手和经验丰富的开发者均能受益。然而，这些模型使用来自开源仓库（如GitHub）未经净化的数据进行训练，增加了无意中传播安全漏洞的风险。尽管已有大量研究探讨代码LLMs的安全性，但在全面评估其安全特性方面仍存在空白。本研究旨在通过精确评估和增强代码LLMs的安全维度展开系统性研究。为支持此项工作，我们提出了CodeSecEval——一个精心构建的数据集，涵盖44种关键漏洞类型共180个独立样本。CodeSecEval为代码模型在两项关键任务（代码生成与代码修复）中的自动化安全评估提供了基础框架。实验结果表明，当前模型在代码生成和修复过程中常忽视安全问题，导致产生易受攻击的代码。为此，我们提出了多种利用漏洞感知信息和不安全代码解释的策略以缓解这些安全隐患。此外，研究发现特定漏洞类型对模型性能构成显著挑战，影响了其在现实场景中的有效性。基于这些发现，我们相信本研究将对软件工程社区产生积极影响，推动开发更完善的LLMs训练与使用方法，从而实现更安全可信的模型部署。