The inner workings of neural networks can be better understood if we can fully decipher the information encoded in neural activations. In this paper, we argue that this information is embodied by the subset of inputs that give rise to similar activations. Computing such subsets is nontrivial as the input space is exponentially large. We propose InversionView, which allows us to practically inspect this subset by sampling from a trained decoder model conditioned on activations. This helps uncover the information content of activation vectors, and facilitates understanding of the algorithms implemented by transformer models. We present four case studies where we investigate models ranging from small transformers to GPT-2. In these studies, we demonstrate the characteristics of our method, show the distinctive advantages it offers, and provide causally verified circuits.
翻译:若能完全破译神经激活中所编码的信息,我们便能更好地理解神经网络的内部工作机制。本文认为,此类信息体现于能够产生相似激活的输入子集中。由于输入空间呈指数级增长,计算此类子集并非易事。我们提出InversionView方法,该方法通过从以激活为条件的训练解码器模型中采样,使我们能够实际观测该子集。这有助于揭示激活向量的信息内容,并促进对Transformer模型所实现算法的理解。我们展示了四项案例研究,其中调查了从小型Transformer到GPT-2的多种模型。在这些研究中,我们阐明了本方法的特性,展示了其独特的优势,并提供了经过因果验证的电路结构。