Recent advances in interpretability suggest we can project weights and hidden states of transformer-based language models (LMs) to their vocabulary, a transformation that makes them more human interpretable. In this paper, we investigate LM attention heads and memory values, the vectors the models dynamically create and recall while processing a given input. By analyzing the tokens they represent through this projection, we identify patterns in the information flow inside the attention mechanism. Based on our discoveries, we create a tool to visualize a forward pass of Generative Pre-trained Transformers (GPTs) as an interactive flow graph, with nodes representing neurons or hidden states and edges representing the interactions between them. Our visualization simplifies huge amounts of data into easy-to-read plots that can reflect the models' internal processing, uncovering the contribution of each component to the models' final prediction. Our visualization also unveils new insights about the role of layer norms as semantic filters that influence the models' output, and about neurons that are always activated during forward passes and act as regularization vectors.
翻译:近来可解释性研究的进展表明,我们可以将基于Transformer的语言模型(LM)的权重和隐藏状态投影到其词汇表上,这种变换使其更易于人类理解。本文研究了LM的注意力头与记忆值——即模型在处理给定输入时动态创建和调用的向量。通过分析这些投影所表征的标记,我们识别出注意力机制内部信息流的模式。基于这些发现,我们构建了一个工具,将生成式预训练Transformer(GPT)的前向传播可视化为交互式流图,其中节点表示神经元或隐藏状态,边表示它们之间的交互。该可视化将海量数据简化为易于解读的图表,能够反映模型的内部处理过程,揭示各组件对模型最终预测的贡献。此外,我们的可视化还揭示了关于层归一化作为影响模型输出的语义过滤器的新见解,以及在前向传播中始终被激活、充当正则化向量的神经元的作用。