Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train $k$-sparse linear classifiers (probes) on these internal activations to predict the presence of features in the input; by varying the value of $k$ we study the sparsity of learned representations and how this varies with model scale. With $k=1$, we localize individual neurons which are highly relevant for a particular feature, and perform a number of case studies to illustrate general properties of LLMs. In particular, we show that early layers make use of sparse combinations of neurons to represent many features in superposition, that middle layers have seemingly dedicated neurons to represent higher-level contextual features, and that increasing scale causes representational sparsity to increase on average, but there are multiple types of scaling dynamics. In all, we probe for over 100 unique features comprising 10 different categories in 7 different models spanning 70 million to 6.9 billion parameters.
翻译:尽管大语言模型(LLMs)已被快速采用和部署,但其内部计算过程仍不透明且难以理解。本研究旨在探究高层级、可人工解释的特征如何在LLMs的内部神经元激活中被表征。我们基于这些内部激活训练$k$稀疏线性分类器(探测头),以预测输入中是否存在特定特征;通过调整$k$值,我们研究习得表征的稀疏性及其随模型规模变化的规律。当$k=1$时,我们定位出与特定特征高度相关的单个神经元,并通过一系列案例研究揭示LLMs的普遍特性。具体而言,我们发现:早期层通过稀疏组合神经元以叠加方式表征多种特征;中间层似乎具有专门表征高层级上下文特征的神经元;而模型规模扩大时,表征稀疏性整体呈上升趋势,但存在多种缩放动力学模式。最终,我们针对7种不同模型(参数规模从7000万至69亿)的10个类别共100余种独特特征进行了探测。