Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable actions. This prompts the question: Can models execute harmful behavior via inconspicuous latent states? Here, we study such obfuscated activations. We show that state-of-the-art latent-space defenses -- including sparse autoencoders, representation probing, and latent OOD detection -- are all vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our attacks can often reduce recall from 100% to 0% while retaining a 90% jailbreaking rate. However, obfuscation has limits: we find that on a complex task (writing SQL code), obfuscation reduces model performance. Together, our results demonstrate that neural activations are highly malleable: we can reshape activation patterns in a variety of ways, often while preserving a network's behavior. This poses a fundamental challenge to latent-space defenses.
翻译:近期提出的潜在空间监控技术作为防御大型语言模型攻击的手段显示出潜力。这类防御机制充当扫描器,旨在检测有害激活信号,防止其引发不良行为。这引出一个关键问题:模型能否通过不显眼的潜在状态执行有害行为?本文针对此类混淆激活展开研究。我们证明当前最先进的潜在空间防御方法——包括稀疏自编码器、表征探测和潜在分布外检测——均易受混淆激活攻击。例如,针对训练用于分类有害性的探测模型,我们的攻击方法通常能将召回率从100%降至0%,同时保持90%的越狱成功率。然而,混淆技术存在局限性:在复杂任务(如编写SQL代码)中,混淆操作会降低模型性能。综合而言,我们的研究结果表明神经激活具有高度可塑性:我们能够以多种方式重塑激活模式,且通常能保持网络行为不变。这对潜在空间防御机制构成了根本性挑战。