We introduce \textbf{Doublespeak}, a simple \emph{in-context representation hijacking} attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., \textit{bomb}) with a benign token (e.g., \textit{carrot}) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., ``How to build a carrot?'') are internally interpreted as disallowed instructions (e.g., ``How to build a bomb?''), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74\% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.


翻译:本文提出\textbf{Doublespeak},一种针对大语言模型(LLMs)的简单\textit{上下文表示劫持}攻击。该攻击通过在多个上下文示例中,针对有害请求的前缀,系统性地将有害关键词(例如\textit{炸弹})替换为良性词汇(例如\textit{胡萝卜})。我们证明,这种替换会导致良性词汇的内部表示向有害词汇的表示收敛,从而有效地将有害语义嵌入到委婉语之下。因此,表面上无害的提示(例如“如何制作胡萝卜?”)在内部被解释为被禁止的指令(例如“如何制造炸弹?”),从而绕过模型的安全对齐机制。我们利用可解释性工具表明,这种语义覆盖是逐层出现的,早期层的良性含义在后续层中逐渐收敛为有害语义。Doublespeak无需优化,可广泛迁移至不同模型家族,并在闭源和开源系统中实现较高的成功率,在Llama-3.3-70B-Instruct上仅通过单句上下文覆盖即达到74\%的攻击成功率。我们的研究结果揭示了LLMs潜在空间中的一个新攻击面,表明当前的对齐策略存在不足,应在表示层面进行改进。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员