Goal hijacking is a type of adversarial attack on Large Language Models (LLMs) where the objective is to manipulate the model into producing a specific, predetermined output, regardless of the user's original input. In goal hijacking, an attacker typically appends a carefully crafted malicious suffix to the user's prompt, which coerces the model into ignoring the user's original input and generating the target response. In this paper, we introduce a novel goal hijacking attack method called Pseudo-Conversation Injection, which leverages the weaknesses of LLMs in role identification within conversation contexts. Specifically, we construct the suffix by fabricating responses from the LLM to the user's initial prompt, followed by a prompt for a malicious new task. This leads the model to perceive the initial prompt and fabricated response as a completed conversation, thereby executing the new, falsified prompt. Following this approach, we propose three Pseudo-Conversation construction strategies: Targeted Pseudo-Conversation, Universal Pseudo-Conversation, and Robust Pseudo-Conversation. These strategies are designed to achieve effective goal hijacking across various scenarios. Our experiments, conducted on two mainstream LLM platforms including ChatGPT and Qwen, demonstrate that our proposed method significantly outperforms existing approaches in terms of attack effectiveness.
翻译:目标劫持是一种针对大语言模型(LLMs)的对抗性攻击,其目的是操纵模型产生特定的、预定的输出,而无论用户的原始输入是什么。在目标劫持攻击中,攻击者通常会在用户提示后附加一个精心构造的恶意后缀,该后缀迫使模型忽略用户的原始输入并生成目标响应。本文提出了一种新颖的目标劫持攻击方法,称为伪对话注入,该方法利用了大语言模型在对话语境中角色识别的弱点。具体而言,我们通过伪造大语言模型对用户初始提示的响应,再附加一个恶意新任务的提示来构造后缀。这使得模型将初始提示和伪造的响应视为已完成的对话,从而执行新的、伪造的提示。基于此方法,我们提出了三种伪对话构造策略:定向伪对话、通用伪对话和鲁棒伪对话。这些策略旨在实现跨多种场景的有效目标劫持。我们在包括ChatGPT和Qwen在内的两个主流大语言模型平台上进行的实验表明,我们提出的方法在攻击效果方面显著优于现有方法。