MEGen: Generative Backdoor in Large Language Models via Model Editing

Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. Emerging as widely adopted generalists for diverse tasks, LLMs are still vulnerable to backdoors. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects. In our approach, we first leverage a language model to insert a trigger selected on fixed metrics into the input, then design a pipeline of model editing to directly embed a backdoor into an LLM. By adjusting a small set of local parameters with a mini-batch of samples, MEGen significantly enhances time efficiency and achieves high robustness. Experimental results indicate that our backdoor attack strategy achieves a high attack success rate on poison data while maintaining the model's performance on clean data. Notably, the backdoored model, when triggered, can freely output pre-set dangerous information while successfully completing downstream tasks. This suggests that future LLM applications could be guided to deliver certain dangerous information, thus altering the LLM's generative style. We believe this approach provides insights for future LLM applications and the execution of backdoor attacks on conversational AI systems.

翻译：大型语言模型（LLMs）已展现出卓越的能力。其强大的生成特性使其能够基于各类查询或指令灵活生成响应。作为广泛适用于多种任务的通用模型，LLMs仍易受后门攻击。本文提出一种基于编辑的生成式后门方法，命名为MEGen，旨在以最小副作用为NLP任务创建定制化后门。在我们的方法中，首先利用语言模型将基于固定指标选择的触发器插入输入，随后设计模型编辑流程以直接将后门嵌入LLM。通过使用小批量样本调整少量局部参数，MEGen显著提升了时间效率并实现了高鲁棒性。实验结果表明，我们的后门攻击策略在毒化数据上实现了高攻击成功率，同时保持了模型在干净数据上的性能。值得注意的是，被植入后门的模型在触发条件下既能自由输出预设的危险信息，又能成功完成下游任务。这表明未来的LLM应用可能被引导输出特定危险信息，从而改变LLM的生成风格。我们相信该方法为未来LLM应用及对话式AI系统的后门攻击实施提供了新的思路。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日