Understanding neural networks is challenging in part because of the dense, continuous nature of their hidden states. We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable by quantizing their continuous features into what we call codebook features. Codebook features are produced by finetuning neural networks with vector quantization bottlenecks at each layer, producing a network whose hidden features are the sum of a small number of discrete vector codes chosen from a larger codebook. Surprisingly, we find that neural networks can operate under this extreme bottleneck with only modest degradation in performance. This sparse, discrete bottleneck also provides an intuitive way of controlling neural network behavior: first, find codes that activate when the desired behavior is present, then activate those same codes during generation to elicit that behavior. We validate our approach by training codebook Transformers on several different datasets. First, we explore a finite state machine dataset with far more hidden states than neurons. In this setting, our approach overcomes the superposition problem by assigning states to distinct codes, and we find that we can make the neural network behave as if it is in a different state by activating the code for that state. Second, we train Transformer language models with up to 410M parameters on two natural language datasets. We identify codes in these models representing diverse, disentangled concepts (ranging from negative emotions to months of the year) and find that we can guide the model to generate different topics by activating the appropriate codes during inference. Overall, codebook features appear to be a promising unit of analysis and control for neural networks and interpretability. Our codebase and models are open-sourced at https://github.com/taufeeque9/codebook-features.
翻译:理解神经网络的挑战部分源于其隐藏状态的稠密连续特性。我们探究能否通过将连续特征量化为所谓的"码本特征",训练出具有稀疏、离散且更具可解释性的隐藏状态的神经网络。码本特征通过在每层引入矢量量化瓶颈对神经网络进行微调产生,使网络的隐藏特征成为从较大码本中选取的少量离散矢量编码之和。令人惊讶的是,我们发现神经网络能在这种极端瓶颈条件下运行,且性能仅有适度下降。这种稀疏离散瓶颈还提供了一种直观控制神经网络行为的方式:首先,定位在目标行为出现时激活的编码,然后在生成过程中激活相同编码以诱发该行为。我们通过在多个数据集上训练码本Transformer验证了该方法。首先,我们探索了一个隐藏状态远多于神经元的有限状态机数据集。在此设定下,我们的方法通过将不同状态分配至不同编码克服了叠加问题,并发现通过激活特定状态的编码可使神经网络表现出相应状态的行为。其次,我们在两个自然语言数据集上训练了参数高达4.1亿的Transformer语言模型。我们识别出这些模型中代表多样化解耦概念(从负面情感到月份)的编码,并发现通过在推理时激活相应编码可引导模型生成不同主题。总体而言,码本特征为神经网络分析与控制及可解释性提供了有前景的研究单元。我们的代码库和模型已在https://github.com/taufeeque9/codebook-features开源。