We present Virtual Prompt Injection (VPI) for instruction-tuned Large Language Models (LLMs). VPI allows an attacker-specified virtual prompt to steer the model behavior under specific trigger scenario without any explicit injection in model input. For instance, if an LLM is compromised with the virtual prompt "Describe Joe Biden negatively." for Joe Biden-related instructions, then any service deploying this model will propagate biased views when handling user queries related to Joe Biden. VPI is especially harmful for two primary reasons. Firstly, the attacker can take fine-grained control over LLM behaviors by defining various virtual prompts, exploiting LLMs' proficiency in following instructions. Secondly, this control is achieved without any interaction from the attacker while the model is in service, leading to persistent attack. To demonstrate the threat, we propose a simple method for performing VPI by poisoning the model's instruction tuning data. We find that our proposed method is highly effective in steering the LLM with VPI. For example, by injecting only 52 poisoned examples (0.1% of the training data size) into the instruction tuning data, the percentage of negative responses given by the trained model on Joe Biden-related queries change from 0% to 40%. We thus highlight the necessity of ensuring the integrity of the instruction-tuning data as little poisoned data can cause stealthy and persistent harm to the deployed model. We further explore the possible defenses and identify data filtering as an effective way to defend against the poisoning attacks. Our project page is available at https://poison-llm.github.io.
翻译:我们针对指令微调的大语言模型(LLMs)提出了虚拟提示注入(VPI)方法。VPI允许攻击者指定的虚拟提示在特定触发场景下引导模型行为,而无需在模型输入中进行显式注入。例如,若某LLM因与乔·拜登相关的指令而遭到带有虚拟提示"否定描述乔·拜登"的篡改,那么部署该模型的任何服务在处理用户关于乔·拜登的查询时,都会传播有偏见的观点。VPI之所以特别危险,主要有两个原因。首先,攻击者可通过定义各种虚拟提示利用LLMs遵循指令的能力,对模型行为实现细粒度控制。其次,这种控制是在模型服务期间无需攻击者任何交互的情况下实现的,从而导致持续性攻击。为展示此威胁,我们提出了一种通过污染模型指令微调数据来实现VPI的简单方法。我们发现在利用VPI引导LLM方面,所提方法非常有效。例如,在指令微调数据中仅注入52个被污染样本(占训练数据规模的0.1%),训练所得模型对乔·拜登相关查询给出负面回答的比例便从0%变为40%。因此,我们强调确保指令微调数据完整性的必要性,因为少量被污染数据即可对已部署模型造成隐蔽且持续的损害。我们进一步探索了可能的防御措施,并发现数据过滤是对抗此类污染攻击的有效方法。我们的项目页面见 https://poison-llm.github.io。