Steering Awareness: Detecting Activation Steering from Within

Activation steering -- adding a vector to a model's residual stream to modify its behavior -- is widely used in safety evaluations as if the model cannot detect the intervention. We test this assumption, introducing steering awareness: a model's ability to infer, during its own forward pass, that a steering vector was injected and what concept it encodes. After fine-tuning, seven instruction-tuned models develop strong steering awareness on held-out concepts; the best reaches 95.5% detection, 71.2% concept identification, and zero false positives on clean inputs. This generalizes to unseen steering vector construction methods when their directions have high cosine similarity to the training distribution but not otherwise, indicating a geometric detector rather than a generic anomaly detector. Surprisingly, detection does not confer resistance; on both factual and safety benchmarks, detection-trained models are consistently more susceptible to steering than their base counterparts. Mechanistically, steering awareness arises not from a localized circuit, but from a distributed transformation that progressively rotates diverse injected vectors into a shared detection direction. Activation steering should therefore not be considered an invisible intervention in safety evaluations.

翻译：激活引导（向模型残差流添加向量以改变其行为）在安全评估中被广泛使用，但通常假设模型无法检测到这种干预。我们检验了这一假设，提出"引导感知"概念：即模型在前向传播过程中推断出被注入引导向量及其编码概念的能力。经过微调后，七个指令微调模型对保留概念展现出强烈的引导感知能力——最优模型达到95.5%的检测准确率、71.2%的概念识别准确率，且对纯净输入零误报。该能力可泛化至未见过的引导向量构建方法，前提是其方向与训练分布具有高余弦相似度（否则无法泛化），这表明其本质是几何检测器而非通用异常检测器。出人意料的是，检测能力并未带来抗性：在事实性和安全性两种基准测试中，经检测训练的模型反而比基础模型更易受引导影响。从机制层面看，引导感知并非源于局部电路，而是通过渐进式旋转多种注入向量至共享检测方向的分布式变换产生。因此，在安全评估中不应将激活引导视为不可见的干预手段。

相关内容

安全评估

关注 11

安全评估分狭义和广义二种。狭义指对一个具有特定功能的工作系统中固有的或潜在的危险及其严重程度所进行的分析与评估，并以既定指数、等级或概率值作出定量的表示，最后根据定量值的大小决定采取预防或防护对策。广义指利用系统工程原理和方法对拟建或已有工程、系统可能存在的危险性及其可能产生的后果进行综合评价和预测，并根据可能导致的事故风险的大小，提出相应的安全对策措施，以达到工程、系统安全的过程。安全评估又称风险评估、危险评估，或称安全评价、风险评价和危险评价。

[ICML 2026] 看见的还是思考的？用奖励机制区分“看错”与“想错”：视觉语言模型奖励感知

专知会员服务

10+阅读 · 5月15日

《重新思考网络安全决策：在不确定情况下利用认知启发法》2024最新论文

专知会员服务

27+阅读 · 2024年2月1日

《知识表示工具在感知支持系统中的应用》加拿大国防研究与发展部（DRDC）

专知会员服务

34+阅读 · 2022年8月26日

【MIla】一种意识启发规划的基于模型强化学习，A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

专知会员服务

24+阅读 · 2022年3月19日