With the growing popularity of Large Language Models (LLMs) in software engineers' daily practices, it is important to ensure that the code generated by these tools is not only functionally correct but also free of vulnerabilities. Although LLMs can help developers to be more productive, prior empirical studies have shown that LLMs can generate insecure code. There are two contributing factors to the insecure code generation. First, existing datasets used to evaluate LLMs do not adequately represent genuine software engineering tasks sensitive to security. Instead, they are often based on competitive programming challenges or classroom-type coding tasks. In real-world applications, the code produced is integrated into larger codebases, introducing potential security risks. Second, existing evaluation metrics primarily focus on the functional correctness of the generated code while ignoring security considerations. Therefore, in this paper, we described SALLM, a framework to benchmark LLMs' abilities to generate secure code systematically. This framework has three major components: a novel dataset of security-centric Python prompts, configurable assessment techniques to evaluate the generated code, and novel metrics to evaluate the models' performance from the perspective of secure code generation.
翻译:随着大型语言模型(LLM)在软件工程师日常实践中日益普及,确保这些工具生成的代码不仅功能正确且不存在漏洞至关重要。尽管LLM能帮助开发者提升效率,但先前的实证研究表明,LLM可能生成不安全的代码。不安全代码的生成主要源于两个因素:首先,现有用于评估LLM的数据集未能充分体现对安全性敏感的真实软件工程任务,而多基于竞技编程挑战或课堂式编码任务。在实际应用中,生成的代码需集成到更大型的代码库中,这会引入潜在的安全风险。其次,现有评估指标主要关注生成代码的功能正确性,却忽视了安全性考量。因此,本文提出SALLM框架,用于系统化评估LLM生成安全代码的能力。该框架包含三大核心组件:一个以安全为中心的Python提示词新数据集、可配置的生成代码评估技术,以及从安全代码生成角度评估模型性能的新颖指标。