Neuron to Graph: Interpreting Language Model Neurons at Scale

Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make them more interpretable and ultimately safe. Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to. We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph. N2G uses truncation and saliency methods to emphasise only the most pertinent tokens to a neuron while enriching dataset examples with diverse samples to better encompass the full spectrum of neuron behaviour. These graphs can be visualised to aid researchers' manual interpretation, and can generate token activations on text for automatic validation by comparison with the neuron's ground truth activations, which we use to show that the model is better at predicting neuron activation than two baseline methods. We also demonstrate how the generated graph representations can be flexibly used to facilitate further automation of interpretability research, by searching for neurons with particular properties, or programmatically comparing neurons to each other to identify similar neurons. Our method easily scales to build graph representations for all neurons in a 6-layer Transformer model using a single Tesla T4 GPU, allowing for wide usability. We release the code and instructions for use at https://github.com/alexjfoote/Neuron2Graph.

翻译：大型语言模型（LLMs）的进步带来了卓越的能力，但其内部机制仍大多未知。为理解这些模型，我们需要揭示单个神经元的功能及其对网络的贡献。本文提出一种新颖的自动化方法，旨在将可解释性技术扩展到LLMs中大量神经元，使其更可解释并最终更安全。传统方法需要检查具有强烈神经元激活的示例，并手动识别模式以解读神经元响应的概念。我们提出神经元到图（N2G），一种创新工具，它能从训练数据集自动提取神经元行为，并将其转化为可解释的图。N2G使用截断和显著性方法强调与神经元最相关的标记，同时用多样化样本丰富数据集示例，以更全面覆盖神经元行为谱。这些图可被可视化以辅助研究人员手动解读，并能生成文本上的标记激活，通过与神经元真实激活的比较进行自动验证，我们证明该模型在预测神经元激活方面优于两种基线方法。我们还展示了生成的图表示可灵活用于促进可解释性研究的进一步自动化，例如搜索具有特定属性的神经元，或通过编程方式比较神经元以识别相似神经元。我们的方法仅需单个Tesla T4 GPU即可轻松扩展为6层Transformer模型中所有神经元构建图表示，实现了广泛可用性。我们发布代码和使用说明于https://github.com/alexjfoote/Neuron2Graph。