Reusing off-the-shelf code snippets from online repositories is a common practice, which significantly enhances the productivity of software developers. To find desired code snippets, developers resort to code search engines through natural language queries. Neural code search models are hence behind many such engines. These models are based on deep learning and gain substantial attention due to their impressive performance. However, the security aspect of these models is rarely studied. Particularly, an adversary can inject a backdoor in neural code search models, which return buggy or even vulnerable code with security/privacy issues. This may impact the downstream software (e.g., stock trading systems and autonomous driving) and cause financial loss and/or life-threatening incidents. In this paper, we demonstrate such attacks are feasible and can be quite stealthy. By simply modifying one variable/function name, the attacker can make buggy/vulnerable code rank in the top 11%. Our attack BADCODE features a special trigger generation and injection procedure, making the attack more effective and stealthy. The evaluation is conducted on two neural code search models and the results show our attack outperforms baselines by 60%. Our user study demonstrates that our attack is more stealthy than the baseline by two times based on the F1 score.
翻译:复用来自在线仓库的现成代码片段是一种常见实践,能显著提升软件开发者的生产率。为寻找所需的代码片段,开发者通过自然语言查询诉诸代码搜索引擎。因此,神经代码搜索模型成为许多此类引擎的核心。这些模型基于深度学习,因其卓越性能而备受关注。然而,这些模型的安全性方面鲜有研究。特别是,攻击者可以在神经代码搜索模型中注入后门,导致返回存在安全/隐私问题的有缺陷甚至易受攻击的代码。这可能影响下游软件(如股票交易系统和自动驾驶),导致经济损失和/或生命威胁事件。在本文中,我们证明了此类攻击是可行的,并且可以相当隐蔽。通过简单地修改变量/函数名,攻击者即可使有缺陷/易受攻击的代码位列前11%。我们的攻击BADCODE采用了一种特殊的触发器生成与注入过程,使攻击更有效且更隐蔽。评估在两种神经代码搜索模型上进行,结果表明我们的攻击比基线方法性能提升60%。用户研究显示,基于F1分数,我们的攻击隐蔽性是基线的两倍。