Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. Emerging as widely adopted generalists for diverse tasks, LLMs are still vulnerable to backdoors. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects. In our approach, we first leverage a language model to insert a trigger selected on fixed metrics into the input, then design a pipeline of model editing to directly embed a backdoor into an LLM. By adjusting a small set of local parameters with a mini-batch of samples, MEGen significantly enhances time efficiency and achieves high robustness. Experimental results indicate that our backdoor attack strategy achieves a high attack success rate on poison data while maintaining the model's performance on clean data. Notably, the backdoored model, when triggered, can freely output pre-set dangerous information while successfully completing downstream tasks. This suggests that future LLM applications could be guided to deliver certain dangerous information, thus altering the LLM's generative style. We believe this approach provides insights for future LLM applications and the execution of backdoor attacks on conversational AI systems.
翻译:大型语言模型(LLMs)已展现出卓越的能力。其强大的生成特性使其能够基于各类查询或指令灵活生成响应。作为广泛适用于多种任务的通用模型,LLMs仍易受后门攻击。本文提出一种基于编辑的生成式后门方法,命名为MEGen,旨在以最小副作用为NLP任务创建定制化后门。在我们的方法中,首先利用语言模型将基于固定指标选择的触发器插入输入,随后设计模型编辑流程以直接将后门嵌入LLM。通过使用小批量样本调整少量局部参数,MEGen显著提升了时间效率并实现了高鲁棒性。实验结果表明,我们的后门攻击策略在毒化数据上实现了高攻击成功率,同时保持了模型在干净数据上的性能。值得注意的是,被植入后门的模型在触发条件下既能自由输出预设的危险信息,又能成功完成下游任务。这表明未来的LLM应用可能被引导输出特定危险信息,从而改变LLM的生成风格。我们相信该方法为未来LLM应用及对话式AI系统的后门攻击实施提供了新的思路。