Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to adversarial manipulations such as jailbreaking via prompt injection attacks. These attacks bypass safety mechanisms to generate restricted or harmful content. In this study, we investigated the underlying latent subspaces of safe and jailbroken states by extracting hidden activations from a LLM. Inspired by attractor dynamics in neuroscience, we hypothesized that LLM activations settle into semi stable states that can be identified and perturbed to induce state transitions. Using dimensionality reduction techniques, we projected activations from safe and jailbroken responses to reveal latent subspaces in lower dimensional spaces. We then derived a perturbation vector that when applied to safe representations, shifted the model towards a jailbreak state. Our results demonstrate that this causal intervention results in statistically significant jailbreak responses in a subset of prompts. Next, we probed how these perturbations propagate through the model's layers, testing whether the induced state change remains localized or cascades throughout the network. Our findings indicate that targeted perturbations induced distinct shifts in activations and model responses. Our approach paves the way for potential proactive defenses, shifting from traditional guardrail based methods to preemptive, model agnostic techniques that neutralize adversarial states at the representation level.
翻译:大型语言模型(LLMs)在各种任务中展现出卓越的能力,但它们仍然容易受到对抗性操纵,例如通过提示注入攻击实现的越狱。这些攻击会绕过安全机制,生成受限或有害内容。在本研究中,我们通过提取LLM的隐藏激活值,探究了安全状态与越狱状态下的潜在子空间。受神经科学中吸引子动力学的启发,我们假设LLM激活会稳定于半稳态,这些状态可被识别并扰动以诱导状态转换。利用降维技术,我们将安全响应与越狱响应的激活值投影到低维空间,以揭示其中的潜在子空间。随后,我们推导出一个扰动向量,当将其应用于安全表征时,能够将模型推向越狱状态。我们的结果表明,这种因果干预在一部分提示中导致了统计上显著的越狱响应。接下来,我们探究了这些扰动如何在模型的各层中传播,测试诱导的状态变化是保持局部性还是在网络中产生级联效应。研究发现,定向扰动引发了激活值与模型响应的显著偏移。我们的方法为潜在的主动防御开辟了新途径,从传统的基于护栏的方法转向先发制人的、模型无关的技术,从而在表征层面中和对抗状态。