Large Language Models (LLMs) typically have billions of parameters and are thus often difficult to interpret in their operation. Such black-box models can pose a significant risk to safety when trusted to make important decisions. The lack of interpretability of LLMs is more related to their sheer size, rather than the complexity of their individual components. The TARS method for knowledge removal (Davies et al 2024) provides strong evidence for the hypothesis that that linear layer weights which act directly on the residual stream may have high correlation with different concepts encoded in the residual stream. Building upon this, we attempt to decode neuron weights directly into token probabilities through the final projection layer of the model (the LM-head). Firstly, we show that with Llama 3.1 8B we can utilise the LM-head to decode specialised feature neurons that respond strongly to certain concepts, with examples such as "dog" and "California". This is then confirmed by demonstrating that these neurons can be clamped to affect the probability of the concept in the output. This extends to the fine-tuned assistant Llama 3.1 8B instruct model, where we find that over 75% of neurons in the up-projection layers have the same top associated token compared to the pretrained model. Finally, we demonstrate that clamping the "dog" neuron leads the instruct model to always discuss dogs when asked about its favourite animal. Through our method, it is possible to map the entirety of Llama 3.1 8B's up-projection neurons in less than 15 minutes with no parallelization.
翻译:大语言模型通常具有数十亿参数,因此其运行机制往往难以解释。当依赖此类黑箱模型做出重要决策时,可能对安全性构成重大风险。大语言模型可解释性的缺失更多源于其庞大的规模,而非单个组件的复杂性。TARS知识移除方法(Davies等人,2024)为以下假设提供了有力证据:直接作用于残差流的线性层权重可能与残差流中编码的不同概念存在高度相关性。基于此,我们尝试通过模型的最终投影层(LM-head)将神经元权重直接解码为词元概率。首先,我们展示在Llama 3.1 8B模型中,可以利用LM-head解码出对特定概念(如“狗”和“加利福尼亚”)产生强烈响应的专用特征神经元。随后通过实验证实,可通过钳制这些神经元来影响输出中对应概念的概率。该方法可进一步推广至微调后的助手模型Llama 3.1 8B Instruct,我们发现其中超过75%的上投影层神经元与预训练模型具有相同的最高关联词元。最后,我们证明钳制“狗”神经元会导致Instruct模型在被问及最喜爱的动物时始终讨论狗。通过本方法,无需并行计算即可在15分钟内完成Llama 3.1 8B全部上投影层神经元的映射。