Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

Jailbreaking large language models (LLMs) has emerged as a critical security challenge with the widespread deployment of conversational AI systems. Adversarial users exploit these models through carefully crafted prompts to elicit restricted or unsafe outputs, a phenomenon commonly referred to as Jailbreaking. Despite numerous proposed defense mechanisms, attackers continue to develop adaptive prompting strategies, and existing models remain vulnerable. This motivates approaches that examine the internal behavior of LLMs rather than relying solely on prompt-level defenses. In this work, we study jailbreaking from both security and interpretability perspectives by analyzing how internal representations differ between jailbreak and benign prompts. We conduct a systematic layer-wise analysis across multiple open-source models, including GPT-J, LLaMA, Mistral, and the state-space model Mamba, and identify consistent latent-space patterns associated with harmful inputs. We then propose a tensor-based latent representation framework that captures structure in hidden activations and enables lightweight jailbreak detection without model fine-tuning or auxiliary LLM-based detectors. We further demonstrate that the latent signals can be used to actively disrupt jailbreak execution at inference time. On an abliterated LLaMA-3.1-8B model, selectively bypassing high-susceptibility layers blocks 78% of jailbreak attempts while preserving benign behavior on 94% of benign prompts. This intervention operates entirely at inference time and introduces minimal overhead, providing a scalable foundation for achieving stronger coverage by incorporating additional attack distributions or more refined susceptibility thresholds. Our results provide evidence that jailbreak behavior is rooted in identifiable internal structures and suggest a complementary, architecture-agnostic direction for improving LLM security.

翻译：随着对话式人工智能系统的广泛部署，大语言模型的越狱攻击已成为关键的安全挑战。恶意用户通过精心设计的提示词利用这些模型，诱导其生成受限或不安全的输出，这一现象通常被称为"越狱"。尽管已提出多种防御机制，攻击者仍在不断开发自适应提示策略，现有模型依然存在脆弱性。这促使我们转向研究大语言模型的内部行为，而非仅仅依赖提示层面的防御。本研究从安全性和可解释性双重视角出发，通过分析越狱提示与良性提示在内部表征上的差异来探究越狱机制。我们对多个开源模型（包括GPT-J、LLaMA、Mistral以及状态空间模型Mamba）进行了系统的分层分析，识别出与有害输入相关的潜在空间一致模式。随后，我们提出一种基于张量的潜在表征框架，该框架能够捕捉隐藏激活中的结构特征，实现无需模型微调或辅助大语言模型检测器的轻量级越狱检测。我们进一步证明，这些潜在信号可用于在推理时主动阻断越狱执行。在消融实验中，对LLaMA-3.1-8B模型选择性跳过高敏感层，成功阻断了78%的越狱尝试，同时在94%的良性提示上保持了正常行为。该干预完全在推理时运行且引入极小开销，为通过整合更多攻击分布或更精细的敏感度阈值来实现更强覆盖提供了可扩展基础。我们的研究结果证明，越狱行为根植于可识别的内部结构，并为改进大语言模型安全性指明了一条互补的、架构无关的新方向。