Reusing off-the-shelf code snippets from online repositories is a common practice, which significantly enhances the productivity of software developers. To find desired code snippets, developers resort to code search engines through natural language queries. Neural code search models are hence behind many such engines. These models are based on deep learning and gain substantial attention due to their impressive performance. However, the security aspect of these models is rarely studied. Particularly, an adversary can inject a backdoor in neural code search models, which return buggy or even vulnerable code with security/privacy issues. This may impact the downstream software (e.g., stock trading systems and autonomous driving) and cause financial loss and/or life-threatening incidents. In this paper, we demonstrate such attacks are feasible and can be quite stealthy. By simply modifying one variable/function name, the attacker can make buggy/vulnerable code rank in the top 11%. Our attack BADCODE features a special trigger generation and injection procedure, making the attack more effective and stealthy. The evaluation is conducted on two neural code search models and the results show our attack outperforms baselines by 60%. Our user study demonstrates that our attack is more stealthy than the baseline by two times based on the F1 score.
翻译:从在线仓库中复用现成的代码片段是一种常见实践,显著提升了软件开发人员的工作效率。为查找所需的代码片段,开发人员常通过自然语言查询使用代码搜索引擎,神经代码搜索模型因此成为此类引擎的核心技术。这些模型基于深度学习,因其卓越性能而备受关注。然而,这些模型的安全性却鲜有研究。特别是,攻击者可以在神经代码搜索模型中植入后门,使其返回存在缺陷甚至包含安全/隐私问题的漏洞代码。这可能影响下游软件(例如股票交易系统和自动驾驶系统),导致经济损失甚至危及生命的事件。本文证明此类攻击是可行的且具有极强的隐蔽性。通过简单地修改一个变量/函数名,攻击者可使存在缺陷/漏洞的代码排名进入前11%。我们的攻击方法BADCODE具备独特的触发器生成与注入流程,使攻击更高效、更隐蔽。在两个神经代码搜索模型上的评估结果表明,我们的攻击比基线方法性能提升60%。用户研究显示,基于F1分数,我们的攻击隐蔽性比基线方法高两倍。