Several advances in deep learning have been successfully applied to the software development process. Of recent interest is the use of neural language models to build tools, such as Copilot, that assist in writing code. In this paper we perform a comparative empirical analysis of Copilot-generated code from a security perspective. The aim of this study is to determine if Copilot is as bad as human developers - we investigate whether Copilot is just as likely to introduce the same software vulnerabilities that human developers did. Using a dataset of C/C++ vulnerabilities, we prompt Copilot to generate suggestions in scenarios that previously led to the introduction of vulnerabilities by human developers. The suggestions are inspected and categorized in a 2-stage process based on whether the original vulnerability or the fix is reintroduced. We find that Copilot replicates the original vulnerable code ~33% of the time while replicating the fixed code at a ~25% rate. However this behavior is not consistent: Copilot is more susceptible to introducing some types of vulnerability than others and is more likely to generate vulnerable code in response to prompts that correspond to older vulnerabilities than newer ones. Overall, given that in a substantial proportion of instances Copilot did not generate code with the same vulnerabilities that human developers had introduced previously, we conclude that Copilot is not as bad as human developers at introducing vulnerabilities in code.
翻译:深度学习领域的多项进展已成功应用于软件开发流程。近期备受关注的是利用神经语言模型构建Copilot等辅助编写代码的工具。本文从安全视角对Copilot生成的代码进行对比实证分析。研究旨在确定Copilot是否与人类开发者同样糟糕——我们探究Copilot是否同样容易引入人类开发者曾犯下的软件安全漏洞。通过使用C/C++漏洞数据集,我们让Copilot在先前曾因人类开发者引入漏洞的场景中生成代码建议。通过两阶段流程对建议代码进行审查与分类,判断其是否复现了原始漏洞或修复方案。研究发现:Copilot复现原始漏洞代码的概率约为33%,而复现修复代码的比例约为25%。然而此行为并不一致——Copilot更易引入某些特定类型的漏洞,且针对较旧漏洞的提示更可能生成易受攻击的代码。总体而言,鉴于在相当比例的案例中Copilot并未生成与人类开发者先前引入的相同漏洞,我们得出结论:Copilot在代码中引入漏洞的糟糕程度不及人类程序员。