Several advances in deep learning have been successfully applied to the software development process. Of recent interest is the use of neural language models to build tools, such as Copilot, that assist in writing code. In this paper we perform a comparative empirical analysis of Copilot-generated code from a security perspective. The aim of this study is to determine if Copilot is as bad as human developers. We investigate whether Copilot is just as likely to introduce the same software vulnerabilities as human developers. Using a dataset of C/C++ vulnerabilities, we prompt Copilot to generate suggestions in scenarios that led to the introduction of vulnerabilities by human developers. The suggestions are inspected and categorized in a 2-stage process based on whether the original vulnerability or fix is reintroduced. We find that Copilot replicates the original vulnerable code about 33% of the time while replicating the fixed code at a 25% rate. However this behaviour is not consistent: Copilot is more likely to introduce some types of vulnerabilities than others and is also more likely to generate vulnerable code in response to prompts that correspond to older vulnerabilities. Overall, given that in a significant number of cases it did not replicate the vulnerabilities previously introduced by human developers, we conclude that Copilot, despite performing differently across various vulnerability types, is not as bad as human developers at introducing vulnerabilities in code.
翻译:深度学习领域的若干进展已成功应用于软件开发过程。近期备受关注的是利用神经语言模型构建辅助编写代码的工具(如Copilot)。本文从安全视角对Copilot生成的代码进行实证比较分析,旨在探究Copilot是否与人类开发者同样糟糕。我们研究了Copilot引入与人类开发者相同软件漏洞的可能性。利用C/C++漏洞数据集,我们促使Copilot在曾导致人类开发者引入漏洞的场景下生成建议,并通过两阶段流程对建议进行审查与分类,判断原始漏洞或修复方案是否被复现。结果显示:Copilot约33%的情况下复现了原始漏洞代码,而以25%的概率复现修复代码。但此行为并不一致——Copilot更易引入某些类型的漏洞,且对对应较旧漏洞的提示更易生成有漏洞代码。总体而言,鉴于在大量案例中Copilot并未复现此前人类开发者引入的漏洞,我们得出结论:尽管Copilot在不同漏洞类型上表现存在差异,但其在代码中引入漏洞的程度上并不比人类开发者更糟糕。