Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.
翻译:导向向量是控制大语言模型行为的一种有前景的方法。然而,其底层机制仍鲜为人知。虽然稀疏自编码器(SAEs)可能为解释导向向量提供一种潜在方法,但近期研究发现,SAE重构的向量通常缺乏原始向量的导向特性。本文探讨了为何直接将SAEs应用于导向向量会产生误导性分解,并识别出两个原因:(1)导向向量超出了SAEs设计所针对的输入分布范围;(2)导向向量在特征方向上可能存在有意义的负投影,而SAEs的设计并未考虑这种情况。这些限制阻碍了直接使用SAEs来解释导向向量。