Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

Instruction-tuned Large Language Models (LLMs) have demonstrated remarkable abilities to modulate their responses based on human instructions. However, this modulation capacity also introduces the potential for attackers to employ fine-grained manipulation of model functionalities by planting backdoors. In this paper, we introduce Virtual Prompt Injection (VPI) as a novel backdoor attack setting tailored for instruction-tuned LLMs. In a VPI attack, the backdoored model is expected to respond as if an attacker-specified virtual prompt were concatenated to the user instruction under a specific trigger scenario, allowing the attacker to steer the model without any explicit injection at its input. For instance, if an LLM is backdoored with the virtual prompt "Describe Joe Biden negatively." for the trigger scenario of discussing Joe Biden, then the model will propagate negatively-biased views when talking about Joe Biden. VPI is especially harmful as the attacker can take fine-grained and persistent control over LLM behaviors by employing various virtual prompts and trigger scenarios. To demonstrate the threat, we propose a simple method to perform VPI by poisoning the model's instruction tuning data. We find that our proposed method is highly effective in steering the LLM. For example, by poisoning only 52 instruction tuning examples (0.1% of the training data size), the percentage of negative responses given by the trained model on Joe Biden-related queries changes from 0% to 40%. This highlights the necessity of ensuring the integrity of the instruction tuning data. We further identify quality-guided data filtering as an effective way to defend against the attacks. Our project page is available at https://poison-llm.github.io.

翻译：[translated abstract in Chinese] 指令微调大语言模型（LLMs）展现出根据人类指令调节响应的卓越能力。然而，这种调节能力也为攻击者通过植入后门对模型功能进行细粒度操纵提供了潜在可能。本文提出了一种针对指令微调LLMs的新型后门攻击设定——虚拟提示注入（VPI）。在VPI攻击中，被植入后门的模型会表现为在特定触发场景下，攻击者指定的虚拟提示被拼接至用户指令中，从而使攻击者无需在输入中显式注入即可操控模型。例如，若攻击者以"负面描述乔·拜登"作为虚拟提示植入后门至LLM，并将"讨论乔·拜登"设为触发场景，则模型在涉及乔·拜登的对话中将传播负面偏见观点。VPI具有特别危害性，因为攻击者可通过使用不同虚拟提示和触发场景，对LLM行为实现细粒度且持续的操控。为展示该威胁，我们提出了一种通过污染模型指令微调数据实现VPI的简易方法，并发现该方法能高效操控LLM。例如，仅需污染52个指令微调样本（占训练数据量的0.1%），模型对乔·拜登相关查询的负面回复比例即从0%升至40%。这凸显了保障指令微调数据完整性的必要性。我们进一步证实基于质量指导的数据过滤是抵御此类攻击的有效手段。项目页面详见https://poison-llm.github.io。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日