Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

Jailbreaking large language models (LLMs) has emerged as a critical security challenge with the widespread deployment of conversational AI systems. Adversarial users exploit these models through carefully crafted prompts to elicit restricted or unsafe outputs, a phenomenon commonly referred to as Jailbreaking. Despite numerous proposed defense mechanisms, attackers continue to develop adaptive prompting strategies, and existing models remain vulnerable. This motivates approaches that examine the internal behavior of LLMs rather than relying solely on prompt-level defenses. In this work, we study jailbreaking from both security and interpretability perspectives by analyzing how internal representations differ between jailbreak and benign prompts. We conduct a systematic layer-wise analysis across multiple open-source models, including GPT-J, LLaMA, Mistral, and the state-space model Mamba, and identify consistent latent-space patterns associated with harmful inputs. We then propose a tensor-based latent representation framework that captures structure in hidden activations and enables lightweight jailbreak detection without model fine-tuning or auxiliary LLM-based detectors. We further demonstrate that the latent signals can be used to actively disrupt jailbreak execution at inference time. On an abliterated LLaMA-3.1-8B model, selectively bypassing high-susceptibility layers blocks 78% of jailbreak attempts while preserving benign behavior on 94% of benign prompts. This intervention operates entirely at inference time and introduces minimal overhead, providing a scalable foundation for achieving stronger coverage by incorporating additional attack distributions or more refined susceptibility thresholds. Our results provide evidence that jailbreak behavior is rooted in identifiable internal structures and suggest a complementary, architecture-agnostic direction for improving LLM security.

翻译：随着对话式人工智能系统的广泛部署，大语言模型的越狱攻击已成为关键的安全挑战。恶意用户通过精心设计的提示词利用这些模型，诱导其生成受限或不安全的输出，这一现象通常被称为"越狱"。尽管已有多种防御机制被提出，攻击者仍在不断开发自适应提示策略，现有模型依然存在脆弱性。这促使研究者从模型内部行为而非仅依赖提示层面防御的角度探索解决方案。本研究从安全性与可解释性双重视角出发，通过分析越狱提示与良性提示在内部表征上的差异来研究越狱攻击机制。我们对包括GPT-J、LLaMA、Mistral及状态空间模型Mamba在内的多个开源模型进行了系统的分层分析，识别出与有害输入相关的潜在空间一致模式。随后，我们提出一种基于张量的潜在表征框架，该框架能够捕捉隐藏激活中的结构特征，实现无需模型微调或辅助LLM检测器的轻量级越狱检测。我们进一步证明，这些潜在信号可用于在推理阶段主动阻断越狱执行。在消融实验的LLaMA-3.1-8B模型中，选择性绕过高敏感层可阻断78%的越狱尝试，同时保持对94%良性提示的正常响应。该干预机制完全在推理阶段运行，引入的开销极小，为通过整合更多攻击分布或更精细的敏感度阈值实现更强防御覆盖提供了可扩展的基础。我们的研究结果证明越狱行为根植于可识别的内部结构，并为改进大语言模型安全性指明了一条互补的、架构无关的研究方向。