Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at inference time. Although effective on average, steering effect sizes vary across samples and are unreliable for many target behaviors. In my thesis, I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data. First, I find that higher cosine similarity between training activation differences predicts more reliable steering. Second, I observe that behavior datasets where positive and negative activations are better separated along the steering direction are more reliably steerable. Finally, steering vectors trained on different prompt variations are directionally distinct, yet perform similarly well and exhibit correlated efficacy across datasets. My findings suggest that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction. Taken together, these insights offer a practical diagnostic for steering unreliability and motivate the development of more robust steering methods that explicitly account for non-linear latent behavior representations.
翻译:导向向量是一种轻量级方法,通过在推理时向激活添加学习偏置来控制语言模型行为。尽管平均而言有效,但导向效应大小在不同样本间存在差异,且对许多目标行为不可靠。在我的论文中,我研究了导向可靠性为何因行为而异,以及它如何受导向向量训练数据的影响。首先,我发现训练激活差异间更高的余弦相似度预示着更可靠的导向。其次,我观察到正负激活沿导向方向分离度更好的行为数据集具有更高的可导向可靠性。最后,在不同提示变体上训练的导向向量在方向上存在差异,但表现相似,且在不同数据集上展现出相关的有效性。我的研究结果表明,当潜在目标行为表征无法通过线性导向方向有效近似时,导向向量是不可靠的。综上所述,这些见解为诊断导向不可靠性提供了实用方法,并激励开发更鲁棒的导向方法,以显式考虑非线性潜在行为表征。