Transfer Attacks and Defenses for Large Language Models on Coding Tasks

Modern large language models (LLMs), such as ChatGPT, have demonstrated impressive capabilities for coding tasks including writing and reasoning about code. They improve upon previous neural network models of code, such as code2seq or seq2seq, that already demonstrated competitive results when performing tasks such as code summarization and identifying code vulnerabilities. However, these previous code models were shown vulnerable to adversarial examples, i.e. small syntactic perturbations that do not change the program's semantics, such as the inclusion of "dead code" through false conditions or the addition of inconsequential print statements, designed to "fool" the models. LLMs can also be vulnerable to the same adversarial perturbations but a detailed study on this concern has been lacking so far. In this paper we aim to investigate the effect of adversarial perturbations on coding tasks with LLMs. In particular, we study the transferability of adversarial examples, generated through white-box attacks on smaller code models, to LLMs. Furthermore, to make the LLMs more robust against such adversaries without incurring the cost of retraining, we propose prompt-based defenses that involve modifying the prompt to include additional information such as examples of adversarially perturbed code and explicit instructions for reversing adversarial perturbations. Our experiments show that adversarial examples obtained with a smaller code model are indeed transferable, weakening the LLMs' performance. The proposed defenses show promise in improving the model's resilience, paving the way to more robust defensive solutions for LLMs in code-related applications.

翻译：现代大型语言模型（LLMs），如ChatGPT，在编码任务（包括代码编写与推理）中展现出卓越能力。它们改进了此前已在代码摘要和漏洞识别等任务中表现优异的神经代码模型（如code2seq或seq2seq）。然而，先前这些代码模型已被证实易受对抗样本攻击——即通过虚假条件插入“死代码”或添加无关紧要的打印语句等不改变程序语义的微小句法扰动，旨在“欺骗”模型。LLMs同样可能面临此类对抗性扰动的威胁，但至今缺乏系统研究。本文旨在探究对抗性扰动对LLMs编码任务的影响，重点考察通过白盒攻击小型代码模型生成的对抗样本向LLMs的迁移特性。此外，为在不增加重训练成本的前提下提升LLMs的鲁棒性，我们提出基于提示的防御策略，通过修改提示信息以纳入对抗扰动代码示例及显式逆向扰动指令等额外内容。实验表明，基于小型代码模型获取的对抗样本确实具有迁移性，能够削弱LLMs的性能。所提出的防御策略在提升模型韧性方面展现出潜力，为代码相关应用中LLMs的更鲁棒防御方案奠定了基础。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日