This paper studies the vulnerabilities of transformer-based Large Language Models (LLMs) to jailbreaking attacks, focusing specifically on the optimization-based Greedy Coordinate Gradient (GCG) strategy. We first observe a positive correlation between the effectiveness of attacks and the internal behaviors of the models. For instance, attacks tend to be less effective when models pay more attention to system prompts designed to ensure LLM safety alignment. Building on this discovery, we introduce an enhanced method that manipulates models' attention scores to facilitate LLM jailbreaking, which we term AttnGCG. Empirically, AttnGCG shows consistent improvements in attack efficacy across diverse LLMs, achieving an average increase of ~7% in the Llama-2 series and ~10% in the Gemma series. Our strategy also demonstrates robust attack transferability against both unseen harmful goals and black-box LLMs like GPT-3.5 and GPT-4. Moreover, we note our attention-score visualization is more interpretable, allowing us to gain better insights into how our targeted attention manipulation facilitates more effective jailbreaking. We release the code at https://github.com/UCSC-VLAA/AttnGCG-attack.
翻译:本文研究了基于Transformer架构的大型语言模型(LLMs)对越狱攻击的脆弱性,特别聚焦于基于优化的贪婪坐标梯度(GCG)策略。我们首先观察到攻击的有效性与模型内部行为之间存在正相关性。例如,当模型对旨在确保LLM安全对齐的系统提示词分配更多注意力时,攻击往往效果较差。基于这一发现,我们提出了一种增强方法,通过操纵模型的注意力分数来促进LLM越狱,我们将其命名为AttnGCG。实验表明,AttnGCG在多种LLMs上均能持续提升攻击效能,在Llama-2系列模型上平均提升约7%,在Gemma系列模型上平均提升约10%。我们的策略还展现出强大的攻击可迁移性,能够有效应对未见过的有害目标以及如GPT-3.5和GPT-4这样的黑盒LLMs。此外,我们注意到我们的注意力分数可视化更具可解释性,使我们能够更深入地理解定向的注意力操纵如何促成更有效的越狱。代码发布于 https://github.com/UCSC-VLAA/AttnGCG-attack。