Large Language Models (LLMs) like Codex are powerful tools for performing code completion and code generation tasks as they are trained on billions of lines of code from publicly available sources. Moreover, these models are capable of generating code snippets from Natural Language (NL) descriptions by learning languages and programming practices from public GitHub repositories. Although LLMs promise an effortless NL-driven deployment of software applications, the security of the code they generate has not been extensively investigated nor documented. In this work, we present LLMSecEval, a dataset containing 150 NL prompts that can be leveraged for assessing the security performance of such models. Such prompts are NL descriptions of code snippets prone to various security vulnerabilities listed in MITRE's Top 25 Common Weakness Enumeration (CWE) ranking. Each prompt in our dataset comes with a secure implementation example to facilitate comparative evaluations against code produced by LLMs. As a practical application, we show how LLMSecEval can be used for evaluating the security of snippets automatically generated from NL descriptions.
翻译:大型语言模型(LLMs)如Codex,因从公开来源的数万亿行代码中训练而成,成为执行代码补全与代码生成任务的强大工具。此外,这些模型通过从公共GitHub仓库学习语言与编程实践,能够根据自然语言描述生成代码片段。尽管LLMs承诺基于自然语言驱动的软件应用开发将变得简便,但其生成代码的安全性尚未得到广泛研究或记录。本文提出LLMSecEval数据集,包含150个自然语言提示,可用于评估此类模型的安全性能。这些提示是易受MITRE Top 25通用弱点枚举排名中各类安全漏洞影响的代码片段的自然语言描述。数据集中每个提示均附有安全实现示例,以便与LLMs生成的代码进行对比评估。作为实际应用,我们展示了如何利用LLMSecEval评估从自然语言描述自动生成的代码片段的安全性。