To control the behavior of language models, steering methods attempt to ensure that outputs of the model satisfy specific pre-defined properties. Adding steering vectors to the model is a promising method of model control that is easier than finetuning, and may be more robust than prompting. However, it can be difficult to anticipate the effects of steering vectors produced by almost all existing methods, such as CAA (Panickssery et al., 2024) or the direct use of SAE latents (Templeton et al., 2024). In our work, we address this issue by using SAEs to measure the effects of steering vectors, giving us a method that can be used to understand the causal effect of any steering vector intervention. We use this method for measuring causal effects to develop an improved steering method, SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific SAE features while minimizing unintended side effects. We show that overall, SAE-TS balances steering effects with coherence better than CAA and SAE feature steering, when evaluated on a range of tasks.
翻译:为控制语言模型的行为,导向方法试图确保模型输出满足特定预定义属性。在模型中添加导向向量是一种有前景的模型控制方法,其实现比微调更简便,且可能比提示方法更具鲁棒性。然而,现有几乎所有方法(如CAA(Panickssery等人,2024)或直接使用SAE隐变量(Templeton等人,2024))生成的导向向量,其效果往往难以预测。本研究通过利用SAE测量导向向量的效应来解决该问题,从而构建了一种可用于理解任意导向向量干预因果效应的方法。我们运用这种因果效应测量方法,开发了一种改进的导向技术——SAE定向导向(SAE-TS),该方法能寻找针对特定SAE特征的导向向量,同时最小化非预期的副作用。实验表明,在一系列任务评估中,SAE-TS在导向效果与连贯性之间的整体平衡性优于CAA和SAE特征导向方法。