Several advances in deep learning have been successfully applied to the software development process. Of recent interest is the use of neural language models to build tools, such as Copilot, that assist in writing code. In this paper we perform a comparative empirical analysis of Copilot-generated code from a security perspective. The aim of this study is to determine if Copilot is as bad as human developers. We investigate whether Copilot is just as likely to introduce the same software vulnerabilities as human developers. Using a dataset of C/C++ vulnerabilities, we prompt Copilot to generate suggestions in scenarios that led to the introduction of vulnerabilities by human developers. The suggestions are inspected and categorized in a 2-stage process based on whether the original vulnerability or fix is reintroduced. We find that Copilot replicates the original vulnerable code about 33% of the time while replicating the fixed code at a 25% rate. However this behaviour is not consistent: Copilot is more likely to introduce some types of vulnerabilities than others and is also more likely to generate vulnerable code in response to prompts that correspond to older vulnerabilities. Overall, given that in a significant number of cases it did not replicate the vulnerabilities previously introduced by human developers, we conclude that Copilot, despite performing differently across various vulnerability types, is not as bad as human developers at introducing vulnerabilities in code.
翻译:深度学习领域的多项进展已成功应用于软件开发过程。近期备受关注的是使用神经语言模型构建辅助编写代码的工具,例如Copilot。本文从安全角度对Copilot生成的代码进行了比较实证分析。本研究旨在判断Copilot是否与人类开发者一样糟糕,探究Copilot是否同样容易引入与人类开发者相同的软件漏洞。我们利用C/C++漏洞数据集,在人类开发者曾引入漏洞的场景下提示Copilot生成建议,并通过两阶段流程检查并分类这些建议,判断原始漏洞或修复方案是否被重现。研究发现,Copilot约33%的情况下重现了原始漏洞代码,25%的情况下重现了已修复代码。然而这种行为并不一致:Copilot更倾向引入某些类型的漏洞,且针对更早期漏洞的提示更容易生成有漏洞的代码。总体而言,鉴于在大量案例中Copilot并未重现人类开发者先前引入的漏洞,我们得出结论:尽管在不同漏洞类型上表现存在差异,Copilot在代码中引入漏洞方面并不如人类开发者糟糕。