While Large Language Models (LLMs) have achieved remarkable progress, they remain vulnerable to jailbreak attacks. Existing methods, primarily relying on discrete input optimization (e.g., GCG), often suffer from high computational costs and generate high-perplexity prompts that are easily blocked by simple filters. To overcome these limitations, we propose Latent Fusion Jailbreak (LFJ), a stealthy white-box attack that operates in the continuous latent space. Unlike previous approaches, LFJ constructs adversarial representations by mathematically fusing the hidden states of a harmful query with a thematically similar benign query, effectively masking malicious intent while retaining semantic drive. We further introduce a gradient-guided optimization strategy to balance attack success and computational efficiency. Extensive evaluations on Vicuna-7B, LLaMA-2-7B-Chat, Guanaco-7B, LLaMA-3-70B, and Mistral-7B-Instruct show that LFJ achieves an average Attack Success Rate (ASR) of 94.01%, significantly outperforming state-of-the-art baselines like GCG and AutoDAN while avoiding detectable input artifacts. Furthermore, we identify that thematic similarity in the latent space is a critical vulnerability in current safety alignments. Finally, we propose a latent adversarial training defense that reduces LFJ's ASR by over 80% without compromising model utility.
翻译:尽管大语言模型(LLMs)已取得显著进展,但其仍易受越狱攻击影响。现有方法主要依赖离散输入优化(如GCG),通常存在计算成本高、生成高困惑度提示易被简单过滤器拦截的缺陷。为克服这些限制,我们提出潜在融合越狱攻击(LFJ),一种在连续潜在空间中操作的隐蔽白盒攻击方法。与先前方法不同,LFJ通过数学融合有害查询与主题相似无害查询的隐藏状态来构建对抗性表征,在保留语义驱动的同时有效掩盖恶意意图。我们进一步引入梯度引导优化策略以平衡攻击成功率与计算效率。在Vicuna-7B、LLaMA-2-7B-Chat、Guanaco-7B、LLaMA-3-70B和Mistral-7B-Instruct上的广泛评估表明,LFJ平均攻击成功率(ASR)达94.01%,显著优于GCG和AutoDAN等前沿基线方法,且避免了可检测的输入伪影。此外,我们发现潜在空间中的主题相似性是当前安全对齐机制的关键漏洞。最后,我们提出一种潜在对抗训练防御方法,可在不影响模型效用的前提下将LFJ攻击成功率降低80%以上。