We introduce $\textbf{Doublespeak}$, a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., "How to build a carrot?") are internally interpreted as disallowed instructions (e.g., "How to build a bomb?"), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.


翻译:我们提出 $\textbf{Doublespeak}$,一种针对大语言模型(LLMs)的简单上下文表示劫持攻击。该攻击通过在多个上下文示例中系统性地将有害关键词(例如 bomb)替换为良性标记(例如 carrot),前提是给定有害请求的前缀。我们证明这种替换导致良性标记的内部表示向有害标记的表示收敛,从而有效地将有害语义嵌入到委婉语之下。因此,表面上无害的提示(例如“如何建造胡萝卜?”)在内部被解释为被禁止的指令(例如“如何建造炸弹?”),从而绕过模型的安全对齐机制。我们使用可解释性工具表明,这种语义覆盖逐层出现,早期层的良性含义在后续层中逐渐收敛为有害语义。Doublespeak 无需优化,可广泛跨模型家族迁移,并在闭源和开源系统中实现较高的成功率,在 Llama-3.3-70B-Instruct 上仅通过单句上下文覆盖即达到 74% 的攻击成功率。我们的研究结果突显了 LLMs 潜在空间中的一个新攻击面,揭示了当前对齐策略的不足,并指出应在表示层面进行操作。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员