Recent work has proposed the linear representation hypothesis: that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Finally, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we find further circular representations by breaking down the hidden states for these tasks into interpretable components.
翻译:近期研究提出了线性表示假说:语言模型通过在激活空间中操作概念("特征")的一维表示来进行计算。与此相反,我们探讨了某些语言模型表示是否可能本质上是多维的。我们首先基于特征能否分解为独立或不共现的低维特征,提出了不可约多维特征的严格定义。受这些定义启发,我们设计了一种可扩展方法,利用稀疏自编码器在GPT-2和Mistral 7B中自动发现多维特征。这些自动发现的特征包含极具可解释性的示例,例如表示星期几和月份名称的圆形特征。我们识别出这些精确的圆形特征被用于解决涉及星期和月份模运算计算任务的具体场景。最后,通过对Mistral 7B和Llama 3 8B的干预实验,我们证明这些圆形特征确实是这些任务中的基本计算单元,并通过将这些任务的隐藏状态分解为可解释组件,发现了更多圆形表示。