Steering vectors (SVs) are a new approach to efficiently adjust language model behaviour at inference time by intervening on intermediate model activations. They have shown promise in terms of improving both capabilities and model alignment. However, the reliability and generalisation properties of this approach are unknown. In this work, we rigorously investigate these properties, and show that steering vectors have substantial limitations both in- and out-of-distribution. In-distribution, steerability is highly variable across different inputs. Depending on the concept, spurious biases can substantially contribute to how effective steering is for each input, presenting a challenge for the widespread use of steering vectors. Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt, resulting in them failing to generalise well. Overall, our findings show that while steering can work well in the right circumstances, there remain many technical difficulties of applying steering vectors to guide models' behaviour at scale.
翻译:导向向量(SVs)是一种通过在推理时干预模型中间激活来高效调整语言模型行为的新方法。该方法在提升模型能力与对齐性方面展现出潜力,但其可靠性与泛化特性尚未明确。本研究系统探究了这些特性,并证明导向向量在分布内与分布外均存在显著局限性。在分布内,不同输入的可导向性存在高度差异:根据具体概念,伪偏差可能显著影响各输入的导向效果,这对导向向量的广泛应用构成挑战。在分布外,尽管导向向量通常能良好泛化,但对于某些概念,它们对提示的合理变化极为敏感,导致泛化能力不足。总体而言,我们的研究结果表明,虽然导向方法在适宜条件下可有效运作,但在大规模应用导向向量引导模型行为方面仍存在诸多技术难题。