Large language models (LLMs) face critical safety challenges, as they can be manipulated to generate harmful content through adversarial prompts and jailbreak attacks. Many defenses are typically either black-box guardrails that filter outputs, or internals-based methods that steer hidden activations by operationalizing safety as a single latent feature or dimension. While effective for simple concepts, this assumption is limiting, as recent evidence shows that abstract concepts such as refusal and temporality are distributed across multiple features rather than isolated in one. To address this limitation, we introduce Graph-Regularized Sparse Autoencoders (GSAEs), which extends SAEs with a Laplacian smoothness penalty on the neuron co-activation graph. Unlike standard SAEs that assign each concept to a single latent feature, GSAEs recover smooth, distributed safety representations as coherent patterns spanning multiple features. We empirically demonstrate that GSAE enables effective runtime safety steering, assembling features into a weighted set of safety-relevant directions and controlling them with a two-stage gating mechanism that activates interventions only when harmful prompts or continuations are detected during generation. This approach enforces refusals adaptively while preserving utility on benign queries. Across safety and QA benchmarks, GSAE steering achieves an average 82% selective refusal rate, substantially outperforming standard SAE steering (42%), while maintaining strong task accuracy (70% on TriviaQA, 65% on TruthfulQA, 74% on GSM8K). Robustness experiments further show generalization across LLaMA-3, Mistral, Qwen, and Phi families and resilience against jailbreak attacks (GCG, AutoDAN), consistently maintaining >= 90% refusal of harmful content.
翻译:大语言模型(LLMs)面临严峻的安全挑战,因其可能通过对抗性提示和越狱攻击被操纵生成有害内容。现有防御方法多为黑盒护栏(过滤输出)或基于内部激活的方法——后者将安全性操作化为单一潜在特征或维度以引导隐藏激活。尽管对简单概念有效,这一假设具有局限性,因为近期证据表明抽象概念(如拒绝和时序性)分布于多个特征而非孤立于单一特征。为克服此局限,我们提出了图正则化稀疏自编码器(GSAEs),通过在神经元共激活图上引入拉普拉斯平滑惩罚来扩展传统SAEs。与将每个概念分配给单一潜在特征的标准SAEs不同,GSAEs能够恢复平滑、分布式的安全表征,形成跨越多个特征的连贯模式。我们通过实验证明,GSAE能够实现有效的运行时安全导向:将特征组合成加权安全相关方向集合,并通过两阶段门控机制进行控制——该机制仅在生成过程中检测到有害提示或续写时才激活干预。该方法在保持良性查询实用性的同时自适应地执行拒绝。在安全性与问答基准测试中,GSAE导向实现了平均82%的选择性拒绝率,显著优于标准SAE导向(42%),同时保持了较强的任务准确率(TriviaQA 70%、TruthfulQA 65%、GSM8K 74%)。鲁棒性实验进一步表明,该方法在LLaMA-3、Mistral、Qwen和Phi系列模型上具有泛化能力,并能有效抵御越狱攻击(GCG、AutoDAN),始终维持≥90%的有害内容拒绝率。