Instruction tuning is an effective technique to align large language models (LLMs) with human intents. In this work, we investigate how an adversary can exploit instruction tuning by injecting specific instruction-following examples into the training data that intentionally changes the model's behavior. For example, an adversary can achieve content injection by injecting training examples that mention target content and eliciting such behavior from downstream models. To achieve this goal, we propose \textit{AutoPoison}, an automated data poisoning pipeline. It naturally and coherently incorporates versatile attack goals into poisoned data with the help of an oracle LLM. We showcase two example attacks: content injection and over-refusal attacks, each aiming to induce a specific exploitable behavior. We quantify and benchmark the strength and the stealthiness of our data poisoning scheme. Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data while maintaining a high level of stealthiness in the poisoned examples. We hope our work sheds light on how data quality affects the behavior of instruction-tuned models and raises awareness of the importance of data quality for responsible deployments of LLMs. Code is available at \url{https://github.com/azshue/AutoPoison}.
翻译:指令微调是一种使大型语言模型(LLMs)与人类意图对齐的有效技术。本研究探讨了攻击者如何通过向训练数据中注入特定的指令遵循示例来有意改变模型行为,从而利用指令微调。例如,攻击者可注入包含目标内容的训练示例,并诱使下游模型生成此类行为,实现内容注入。为此,我们提出**AutoPoison**——一种自动化数据投毒流水线。该流水线借助预言机LLM将多样化的攻击目标自然且连贯地融入中毒数据。我们展示了两种示例攻击:内容注入攻击与过度拒绝攻击,分别旨在诱导特定的可利用行为。我们量化并评估了数据投毒方案的强度与隐蔽性。结果表明,AutoPoison允许攻击者仅投毒少量数据即可改变模型行为,同时确保中毒示例具有高隐蔽性。我们期望此工作能揭示数据质量如何影响指令微调模型的行为,并提升对LLM负责任部署中数据质量重要性的认知。代码见\url{https://github.com/azshue/AutoPoison}。