Activation steering -- adding a vector to a model's residual stream to modify its behavior -- is widely used in safety evaluations as if the model cannot detect the intervention. We test this assumption, introducing steering awareness: a model's ability to infer, during its own forward pass, that a steering vector was injected and what concept it encodes. After fine-tuning, seven instruction-tuned models develop strong steering awareness on held-out concepts; the best reaches 95.5% detection, 71.2% concept identification, and zero false positives on clean inputs. This generalizes to unseen steering vector construction methods when their directions have high cosine similarity to the training distribution but not otherwise, indicating a geometric detector rather than a generic anomaly detector. Surprisingly, detection does not confer resistance; on both factual and safety benchmarks, detection-trained models are consistently more susceptible to steering than their base counterparts. Mechanistically, steering awareness arises not from a localized circuit, but from a distributed transformation that progressively rotates diverse injected vectors into a shared detection direction. Activation steering should therefore not be considered an invisible intervention in safety evaluations.
翻译:激活引导(向模型残差流添加向量以改变其行为)在安全评估中被广泛使用,但通常假设模型无法检测到这种干预。我们检验了这一假设,提出"引导感知"概念:即模型在前向传播过程中推断出被注入引导向量及其编码概念的能力。经过微调后,七个指令微调模型对保留概念展现出强烈的引导感知能力——最优模型达到95.5%的检测准确率、71.2%的概念识别准确率,且对纯净输入零误报。该能力可泛化至未见过的引导向量构建方法,前提是其方向与训练分布具有高余弦相似度(否则无法泛化),这表明其本质是几何检测器而非通用异常检测器。出人意料的是,检测能力并未带来抗性:在事实性和安全性两种基准测试中,经检测训练的模型反而比基础模型更易受引导影响。从机制层面看,引导感知并非源于局部电路,而是通过渐进式旋转多种注入向量至共享检测方向的分布式变换产生。因此,在安全评估中不应将激活引导视为不可见的干预手段。