To control the behavior of language models, steering methods attempt to ensure that outputs of the model satisfy specific pre-defined properties. Adding steering vectors to the model is a promising method of model control that is easier than finetuning, and may be more robust than prompting. However, it can be difficult to anticipate the effects of steering vectors produced by methods such as CAA [Panickssery et al., 2024] or the direct use of SAE latents [Templeton et al., 2024]. In our work, we address this issue by using SAEs to measure the effects of steering vectors, giving us a method that can be used to understand the causal effect of any steering vector intervention. We use this method for measuring causal effects to develop an improved steering method, SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific SAE features while minimizing unintended side effects. We show that overall, SAE-TS balances steering effects with coherence better than CAA and SAE feature steering, when evaluated on a range of tasks.
翻译:为控制语言模型的行为,导向方法试图确保模型输出满足特定预定义属性。在模型中添加导向向量是一种前景广阔的模型控制方法,其实现难度低于微调,且可能比提示方法更具鲁棒性。然而,对于CAA [Panickssery et al., 2024]等方法产生的导向向量或直接使用SAE隐变量 [Templeton et al., 2024]的效果,往往难以预先评估。本研究通过利用SAE测量导向向量的影响来解决该问题,从而建立了一种可理解任意导向向量干预因果效应的分析方法。基于此因果效应测量方法,我们开发了改进的导向技术——SAE目标导向法(SAE-TS),该方法在定位特定SAE特征的同时能最小化非预期副作用。实验表明,在多项任务评估中,SAE-TS在导向效果与连贯性方面的综合表现优于CAA与SAE特征导向方法。