Systematically Finding Security Vulnerabilities in Black-Box Code Generation Models

Recently, large language models for code generation have achieved breakthroughs in several programming language tasks. Their advances in competition-level programming problems have made them an emerging pillar in AI-assisted pair programming. Tools such as GitHub Copilot are already part of the daily programming workflow and are used by more than a million developers. The training data for these models is usually collected from open-source repositories (e.g., GitHub) that contain software faults and security vulnerabilities. This unsanitized training data can lead language models to learn these vulnerabilities and propagate them in the code generation procedure. Given the wide use of these models in the daily workflow of developers, it is crucial to study the security aspects of these models systematically. In this work, we propose the first approach to automatically finding security vulnerabilities in black-box code generation models. To achieve this, we propose a novel black-box inversion approach based on few-shot prompting. We evaluate the effectiveness of our approach by examining code generation models in the generation of high-risk security weaknesses. We show that our approach automatically and systematically finds 1000s of security vulnerabilities in various code generation models, including the commercial black-box model GitHub Copilot.

翻译：最近，面向代码生成的大型语言模型在多项编程语言任务上取得了突破性进展。它们在竞赛级编程问题上的提升使其成为AI辅助结对编程的新兴支柱。诸如GitHub Copilot等工具已融入日常编程工作流程，被超过百万开发者使用。这些模型的训练数据通常来自包含软件缺陷和安全漏洞的开源仓库（如GitHub）。未经净化的训练数据可能导致语言模型学习这些漏洞，并在代码生成过程中传播。鉴于这些模型在开发者日常工作中的广泛使用，系统性研究其安全特性至关重要。本文提出了首个自动检测黑盒代码生成模型中安全漏洞的方法。为此，我们基于少样本提示提出了一种新型黑盒逆向方法。通过检测代码生成模型生成的高风险安全弱点，我们评估了该方法的有效性。实验表明，本方法能自动、系统地发现包括商业黑盒模型GitHub Copilot在内的多种代码生成模型中的数千个安全漏洞。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日