Probing Latent Subspaces in LLM for AI Security: Identifying and Manipulating Adversarial States

Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to adversarial manipulations such as jailbreaking via prompt injection attacks. These attacks bypass safety mechanisms to generate restricted or harmful content. In this study, we investigated the underlying latent subspaces of safe and jailbroken states by extracting hidden activations from a LLM. Inspired by attractor dynamics in neuroscience, we hypothesized that LLM activations settle into semi stable states that can be identified and perturbed to induce state transitions. Using dimensionality reduction techniques, we projected activations from safe and jailbroken responses to reveal latent subspaces in lower dimensional spaces. We then derived a perturbation vector that when applied to safe representations, shifted the model towards a jailbreak state. Our results demonstrate that this causal intervention results in statistically significant jailbreak responses in a subset of prompts. Next, we probed how these perturbations propagate through the model's layers, testing whether the induced state change remains localized or cascades throughout the network. Our findings indicate that targeted perturbations induced distinct shifts in activations and model responses. Our approach paves the way for potential proactive defenses, shifting from traditional guardrail based methods to preemptive, model agnostic techniques that neutralize adversarial states at the representation level.

翻译：大型语言模型（LLMs）在各种任务中展现出卓越的能力，但它们仍然容易受到对抗性操纵，例如通过提示注入攻击实现的越狱。这些攻击会绕过安全机制，生成受限或有害内容。在本研究中，我们通过提取LLM的隐藏激活值，探究了安全状态与越狱状态下的潜在子空间。受神经科学中吸引子动力学的启发，我们假设LLM激活会稳定于半稳态，这些状态可被识别并扰动以诱导状态转换。利用降维技术，我们将安全响应与越狱响应的激活值投影到低维空间，以揭示其中的潜在子空间。随后，我们推导出一个扰动向量，当将其应用于安全表征时，能够将模型推向越狱状态。我们的结果表明，这种因果干预在一部分提示中导致了统计上显著的越狱响应。接下来，我们探究了这些扰动如何在模型的各层中传播，测试诱导的状态变化是保持局部性还是在网络中产生级联效应。研究发现，定向扰动引发了激活值与模型响应的显著偏移。我们的方法为潜在的主动防御开辟了新途径，从传统的基于护栏的方法转向先发制人的、模型无关的技术，从而在表征层面中和对抗状态。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日