Activation steering methods, such as persona vectors, are widely used to control large language model behavior and increasingly interpreted as revealing meaningful internal representations. This interpretation implicitly assumes steering directions are identifiable and uniquely recoverable from input-output behavior. We formalize steering as an intervention on internal representations and prove that, under realistic modeling and data conditions, steering vectors are fundamentally non-identifiable due to large equivalence classes of behaviorally indistinguishable interventions. Empirically, we validate this across multiple models and semantic traits, showing orthogonal perturbations achieve near-equivalent efficacy with negligible effect sizes. However, identifiability is recoverable under structural assumptions including statistical independence, sparsity constraints, multi-environment validation or cross-layer consistency. These findings reveal fundamental interpretability limits and clarify structural assumptions required for reliable safety-critical control.
翻译:激活导向方法(如角色向量)被广泛用于控制大型语言模型的行为,并日益被解释为揭示了有意义的内部表征。这种解释隐含地假设导向方向具有可辨识性,并能从输入输出行为中唯一地恢复。我们将导向形式化为对内部表征的干预,并证明在现实的建模和数据条件下,由于存在行为不可区分干预的大规模等价类,导向向量本质上具有不可辨识性。通过实证研究,我们在多个模型和语义特征上验证了这一结论,表明正交扰动能够实现近乎等效的效果且效应量可忽略不计。然而,在统计独立性、稀疏性约束、多环境验证或跨层一致性等结构性假设下,可辨识性是可恢复的。这些发现揭示了可解释性的根本性局限,并阐明了实现可靠安全关键控制所需的结构性假设。