We find the location of factual knowledge in large language models by exploring the residual stream and analyzing subvalues in vocabulary space. We find the reason why subvalues have human-interpretable concepts when projecting into vocabulary space. The before-softmax values of subvalues are added by an addition function, thus the probability of top tokens in vocabulary space will increase. Based on this, we find using log probability increase to compute the significance of layers and subvalues is better than probability increase, since the curve of log probability increase has a linear monotonically increasing shape. Moreover, we calculate the inner products to evaluate how much a feed-forward network (FFN) subvalue is activated by previous layers. Base on our methods, we find where factual knowledge <France, capital, Paris> is stored. Specifically, attention layers store "Paris is related to France". FFN layers store "Paris is a capital/city", activated by attention subvalues related to "capital". We leverage our method on Baevski-18, GPT2 medium, Llama-7B and Llama-13B. Overall, we provide a new method for understanding the mechanism of transformers. We will release our code on github.
翻译:我们通过探索残差流并分析词汇空间中的子值,定位了大语言模型中事实知识的位置。我们发现了子值在投射到词汇空间时具有人类可解释概念的原因。子值的softmax前值通过加法函数叠加,从而提升词汇空间中顶部词元的概率。基于此,我们发现使用对数概率增量计算层与子值的重要性优于概率增量,因为对数概率增量曲线呈现线性单调递增形态。进一步,我们通过计算内积评估前馈网络子值受前层激活的程度。基于所提方法,我们定位了事实知识<法国,首都,巴黎>的存储位置:具体而言,注意力层存储"巴黎与法国相关",前馈网络层存储"巴黎是首都/城市",并由与"首都"相关的注意力子值激活。我们将该方法应用于Baevski-18、GPT2 medium、Llama-7B及Llama-13B模型。总体而言,本研究为理解Transformer机制提供了新方法,相关代码将在GitHub开源。