Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code

With the growing popularity of Large Language Models (e.g. GitHub Copilot, ChatGPT, etc.) in software engineers' daily practices, it is important to ensure that the code generated by these tools is not only functionally correct but also free of vulnerabilities. Although LLMs can help developers to be more productive, prior empirical studies have shown that LLMs can generate insecure code. There are two contributing factors to the insecure code generation. First, existing datasets used to evaluate Large Language Models (LLMs) do not adequately represent genuine software engineering tasks sensitive to security. Instead, they are often based on competitive programming challenges or classroom-type coding tasks. In real-world applications, the code produced is integrated into larger codebases, introducing potential security risks. There's a clear absence of benchmarks that focus on evaluating the security of the generated code. Second, existing evaluation metrics primarily focus on the functional correctness of the generated code while ignoring security considerations. Metrics such as pass@k gauge the probability of obtaining the correct code in the top k suggestions. Other popular metrics like BLEU, CodeBLEU, ROUGE, and METEOR similarly emphasize functional accuracy, neglecting security implications. In light of these research gaps, in this paper, we described SALLM, a framework to benchmark LLMs' abilities to generate secure code systematically. This framework has three major components: a novel dataset of security-centric Python prompts, an evaluation environment to test the generated code, and novel metrics to evaluate the models' performance from the perspective of secure code generation.

翻译：随着大型语言模型（如GitHub Copilot、ChatGPT等）在软件工程师日常实践中的日益普及，确保这些工具生成的代码不仅功能正确且不存在漏洞至关重要。尽管LLM能够提升开发效率，但先前的实证研究表明，LLM可能生成不安全的代码。导致不安全代码生成的因素有两方面：首先，现有用于评估大型语言模型的数据集未能充分代表与安全相关的真实软件工程任务，而常基于竞赛编程或课堂式编码任务。在实际应用中，生成代码需集成到更大的代码库中，这会引入潜在安全风险，目前明显缺乏专注于评估生成代码安全性的基准测试。其次，现有评估指标主要关注生成代码的功能正确性而忽视安全性考量，例如pass@k指标衡量在前k个建议中获得正确代码的概率，而BLEU、CodeBLEU、ROUGE和METEOR等流行指标同样侧重功能准确性，忽略安全影响。基于上述研究空白，本文提出SALLM框架，用于系统性地评估LLM生成安全代码的能力。该框架包含三个核心组件：一个以安全为中心的Python提示数据集、一个测试生成代码的评估环境，以及从安全代码生成角度评估模型性能的新型指标体系。