Interpretable LLM Guardrails via Sparse Representation Steering

Large language models (LLMs) exhibit impressive capabilities in generation tasks but are prone to producing harmful, misleading, or biased content, posing significant ethical and safety concerns. To mitigate such risks, representation engineering, which steer model behavior toward desired attributes by injecting carefully designed steering vectors into LLM's representations at inference time, has emerged as a promising alternative to fine-tuning approaches. However, due to the semantically entangled nature of LLM's representation, existing representation engineering methods still suffer from several limitations: limited fine-grained controllability, content quality degradation, and conflict in multi-attribute control. To overcome these challenges, we propose Sparse Representation Steering (SRS), a novel framework that achieves fine-grained and interpretable control over LLM behavior by first disentangling internal activations into a sparse, semantically meaningful representation space, and then selectively steering relevant dimensions. Specifically, SRS leverages a pretrained Sparse Autoencoder (SAE) to transform dense, entangled activation patterns into a sparse monosemantic feature space. To identify relevant features, SRS contrasts sparse activations from positive and negative prompt pairs and measures their bidirectional KL divergence to locate dimensions most associated with the target attribute. We conduct comprehensive experiments on Gemma-2 series model across three alignment dimensions, i.e., safety, fairness, and truthfulness, to evaluate the effectiveness of SRS. Results show that SRS consistently outperforms existing steering methods, which achieves significantly improved controllability across both single and multiple attribute settings, while preserving high linguistic quality and general ability.

翻译：大语言模型（LLMs）在生成任务中展现出令人印象深刻的能力，但容易产生有害、误导性或带有偏见的内容，引发了重大的伦理与安全问题。为缓解此类风险，表示工程通过在推理阶段向LLM的表示中注入精心设计的引导向量，使模型行为朝向期望属性调整，已成为一种有前景的替代微调方法。然而，由于LLM表示在语义上存在纠缠性，现有表示工程方法仍存在若干局限：细粒度可控性有限、内容质量下降以及多属性控制中的冲突。为克服这些挑战，我们提出稀疏表示引导（SRS），这是一个新颖的框架，通过先将内部激活解耦为稀疏且语义明确的表示空间，再选择性引导相关维度，实现对LLM行为的细粒度可解释控制。具体而言，SRS利用预训练的稀疏自编码器（SAE）将稠密纠缠的激活模式转换为稀疏的单语义特征空间。为识别相关特征，SRS对比来自正负提示对的稀疏激活，并通过计算双向KL散度定位与目标属性最相关的维度。我们在Gemma-2系列模型上针对三个对齐维度（即安全性、公平性和真实性）进行了全面实验，以评估SRS的有效性。结果表明，SRS在单属性和多属性设置下均显著提升了可控性，同时保持了较高的语言质量和通用能力，持续优于现有引导方法。