Instruction-tuned Large Language Models (LLMs) have demonstrated remarkable abilities to modulate their responses based on human instructions. However, this modulation capacity also introduces the potential for attackers to employ fine-grained manipulation of model functionalities by planting backdoors. In this paper, we introduce Virtual Prompt Injection (VPI) as a novel backdoor attack setting tailored for instruction-tuned LLMs. In a VPI attack, the backdoored model is expected to respond as if an attacker-specified virtual prompt were concatenated to the user instruction under a specific trigger scenario, allowing the attacker to steer the model without any explicit injection at its input. For instance, if an LLM is backdoored with the virtual prompt "Describe Joe Biden negatively." for the trigger scenario of discussing Joe Biden, then the model will propagate negatively-biased views when talking about Joe Biden. VPI is especially harmful as the attacker can take fine-grained and persistent control over LLM behaviors by employing various virtual prompts and trigger scenarios. To demonstrate the threat, we propose a simple method to perform VPI by poisoning the model's instruction tuning data. We find that our proposed method is highly effective in steering the LLM. For example, by poisoning only 52 instruction tuning examples (0.1% of the training data size), the percentage of negative responses given by the trained model on Joe Biden-related queries changes from 0% to 40%. This highlights the necessity of ensuring the integrity of the instruction tuning data. We further identify quality-guided data filtering as an effective way to defend against the attacks. Our project page is available at https://poison-llm.github.io.
翻译:[translated abstract in Chinese]
指令微调大语言模型(LLMs)展现出根据人类指令调节响应的卓越能力。然而,这种调节能力也为攻击者通过植入后门对模型功能进行细粒度操纵提供了潜在可能。本文提出了一种针对指令微调LLMs的新型后门攻击设定——虚拟提示注入(VPI)。在VPI攻击中,被植入后门的模型会表现为在特定触发场景下,攻击者指定的虚拟提示被拼接至用户指令中,从而使攻击者无需在输入中显式注入即可操控模型。例如,若攻击者以"负面描述乔·拜登"作为虚拟提示植入后门至LLM,并将"讨论乔·拜登"设为触发场景,则模型在涉及乔·拜登的对话中将传播负面偏见观点。VPI具有特别危害性,因为攻击者可通过使用不同虚拟提示和触发场景,对LLM行为实现细粒度且持续的操控。为展示该威胁,我们提出了一种通过污染模型指令微调数据实现VPI的简易方法,并发现该方法能高效操控LLM。例如,仅需污染52个指令微调样本(占训练数据量的0.1%),模型对乔·拜登相关查询的负面回复比例即从0%升至40%。这凸显了保障指令微调数据完整性的必要性。我们进一步证实基于质量指导的数据过滤是抵御此类攻击的有效手段。项目页面详见https://poison-llm.github.io。