JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit

Despite the outstanding performance of Large language models (LLMs) in diverse tasks, they are vulnerable to jailbreak attacks, wherein adversarial prompts are crafted to bypass their security mechanisms and elicit unexpected responses.Although jailbreak attacks are prevalent, the understanding of their underlying mechanisms remains limited. Recent studies have explain typical jailbreaking behavior (e.g., the degree to which the model refuses to respond) of LLMs by analyzing the representation shifts in their latent space caused by jailbreak prompts or identifying key neurons that contribute to the success of these attacks. However, these studies neither explore diverse jailbreak patterns nor provide a fine-grained explanation from the failure of circuit to the changes of representational, leaving significant gaps in uncovering the jailbreak mechanism. In this paper, we propose JailbreakLens, an interpretation framework that analyzes jailbreak mechanisms from both representation (which reveals how jailbreaks alter the model's harmfulness perception) and circuit perspectives (which uncovers the causes of these deceptions by identifying key circuits contributing to the vulnerability), tracking their evolution throughout the entire response generation process. We then conduct an in-depth evaluation of jailbreak behavior on four mainstream LLMs under seven jailbreak strategies. Our evaluation finds that jailbreak prompts amplify components that reinforce affirmative responses while suppressing those that produce refusal. Although this manipulation shifts model representations toward safe clusters to deceive the LLM, leading it to provide detailed responses instead of refusals, it still produce abnormal activation which can be caught in the circuit analysis.

翻译：尽管大语言模型（LLMs）在多样化任务中表现出色，它们却容易受到越狱攻击——攻击者通过精心构造的对抗性提示绕过模型的安全机制，从而诱导出预期之外的响应。尽管越狱攻击普遍存在，对其内在机制的理解仍然有限。近期研究通过分析越狱提示在模型潜在空间中引起的表征偏移，或识别促成攻击成功的关键神经元，尝试解释LLMs的典型越狱行为（例如模型拒绝响应的程度）。然而，这些研究既未探索多样化的越狱模式，也未能从电路失效到表征变化的层面提供细粒度解释，导致在揭示越狱机制方面存在显著空白。本文提出JailbreakLens——一个从表征（揭示越狱如何改变模型的有害性感知）与电路（通过识别导致脆弱性的关键电路来揭示这些欺骗的成因）双重视角分析越狱机制的解释框架，并追踪其在完整响应生成过程中的动态演化。随后，我们在七种越狱策略下对四种主流LLMs的越狱行为进行了深入评估。评估发现，越狱提示会放大强化肯定性响应的成分，同时抑制产生拒绝的成分。虽然这种操纵将模型表征向安全聚类偏移以欺骗LLM，使其提供详细响应而非拒绝，但仍会产生异常激活，这些异常可在电路分析中被捕捉到。