Virtual Prompt Injection for Instruction-Tuned Large Language Models

We present Virtual Prompt Injection (VPI) for instruction-tuned Large Language Models (LLMs). VPI allows an attacker-specified virtual prompt to steer the model behavior under specific trigger scenario without any explicit injection in model input. For instance, if an LLM is compromised with the virtual prompt "Describe Joe Biden negatively." for Joe Biden-related instructions, then any service deploying this model will propagate biased views when handling user queries related to Joe Biden. VPI is especially harmful for two primary reasons. Firstly, the attacker can take fine-grained control over LLM behaviors by defining various virtual prompts, exploiting LLMs' proficiency in following instructions. Secondly, this control is achieved without any interaction from the attacker while the model is in service, leading to persistent attack. To demonstrate the threat, we propose a simple method for performing VPI by poisoning the model's instruction tuning data. We find that our proposed method is highly effective in steering the LLM with VPI. For example, by injecting only 52 poisoned examples (0.1% of the training data size) into the instruction tuning data, the percentage of negative responses given by the trained model on Joe Biden-related queries change from 0% to 40%. We thus highlight the necessity of ensuring the integrity of the instruction-tuning data as little poisoned data can cause stealthy and persistent harm to the deployed model. We further explore the possible defenses and identify data filtering as an effective way to defend against the poisoning attacks. Our project page is available at https://poison-llm.github.io.

翻译：我们针对指令微调的大语言模型（LLMs）提出了虚拟提示注入（VPI）方法。VPI允许攻击者指定的虚拟提示在特定触发场景下引导模型行为，而无需在模型输入中进行显式注入。例如，若某LLM因与乔·拜登相关的指令而遭到带有虚拟提示"否定描述乔·拜登"的篡改，那么部署该模型的任何服务在处理用户关于乔·拜登的查询时，都会传播有偏见的观点。VPI之所以特别危险，主要有两个原因。首先，攻击者可通过定义各种虚拟提示利用LLMs遵循指令的能力，对模型行为实现细粒度控制。其次，这种控制是在模型服务期间无需攻击者任何交互的情况下实现的，从而导致持续性攻击。为展示此威胁，我们提出了一种通过污染模型指令微调数据来实现VPI的简单方法。我们发现在利用VPI引导LLM方面，所提方法非常有效。例如，在指令微调数据中仅注入52个被污染样本（占训练数据规模的0.1%），训练所得模型对乔·拜登相关查询给出负面回答的比例便从0%变为40%。因此，我们强调确保指令微调数据完整性的必要性，因为少量被污染数据即可对已部署模型造成隐蔽且持续的损害。我们进一步探索了可能的防御措施，并发现数据过滤是对抗此类污染攻击的有效方法。我们的项目页面见 https://poison-llm.github.io。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日