CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models

Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Their advances in competition-level programming problems have made them an essential pillar of AI-assisted pair programming, and tools such as GitHub Copilot have emerged as part of the daily programming workflow used by millions of developers. The training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure. While these models have been extensively assessed for their ability to produce functionally correct programs, there remains a lack of comprehensive investigations and benchmarks addressing the security aspects of these models. In this work, we propose a method to systematically study the security issues of code language models to assess their susceptibility to generating vulnerable code. To this end, we introduce the first approach to automatically find generated code that contains vulnerabilities in black-box code generation models. To achieve this, we present an approach to approximate inversion of the black-box code generation models based on few-shot prompting. We evaluate the effectiveness of our approach by examining code language models in generating high-risk security weaknesses. Furthermore, we establish a collection of diverse non-secure prompts for various vulnerability scenarios using our method. This dataset forms a benchmark for evaluating and comparing the security weaknesses in code language models.

翻译：用于自动代码生成的大型语言模型（LLMs）已在多项编程任务中取得突破。它们在竞赛级编程问题上的进展使其成为人工智能辅助结对编程的重要支柱，GitHub Copilot 等工具已成为数百万开发者日常编程工作流的一部分。这些模型的训练数据通常从互联网收集（例如从开源仓库中获取），很可能包含错误和安全漏洞。未经验证的训练数据可能导致语言模型习得这些漏洞，并在代码生成过程中传播。尽管这些模型生成功能正确代码的能力已得到广泛评估，但针对其安全性的全面研究和基准测试仍然匮乏。本文提出一种系统性研究代码语言模型安全问题的方法，以评估其生成易受攻击代码的倾向。为此，我们首次提出一种自动发现黑盒代码生成模型中包含漏洞的生成代码的方法。具体而言，我们基于少样本提示技术，提出一种近似逆向黑盒代码生成模型的方法。通过检测代码语言模型生成高风险安全漏洞的情况，我们评估了该方法的效果。此外，我们利用该方法构建了涵盖多类漏洞场景的多样化非安全提示数据集。该数据集可作为评估和比较代码语言模型中安全漏洞的基准测试集合。