A key goal of current mechanistic interpretability research in NLP is to find linear features (also called "feature vectors") for transformers: directions in activation space corresponding to concepts that are used by a given model in its computation. Present state-of-the-art methods for finding linear features require large amounts of labelled data -- both laborious to acquire and computationally expensive to utilize. In this work, we introduce a novel method, called "observable propagation" (in short: ObsProp), for finding linear features used by transformer language models in computing a given task -- using almost no data. Our paradigm centers on the concept of observables, linear functionals corresponding to given tasks. We then introduce a mathematical theory for the analysis of feature vectors: we provide theoretical motivation for why LayerNorm nonlinearities do not affect the direction of feature vectors; we also introduce a similarity metric between feature vectors called the coupling coefficient which estimates the degree to which one feature's output correlates with another's. We use ObsProp to perform extensive qualitative investigations into several tasks, including gendered occupational bias, political party prediction, and programming language detection. Our results suggest that ObsProp surpasses traditional approaches for finding feature vectors in the low-data regime, and that ObsProp can be used to better understand the mechanisms responsible for bias in large language models. Code for experiments can be found at github.com/jacobdunefsky/ObservablePropagation.
翻译:当前NLP机械可解释性研究的核心目标之一是寻找Transformer的线性特征(也称为“特征向量”):即激活空间中对应特定模型在计算中所使用概念的特定方向。现有最先进的线性特征挖掘方法需要大量标注数据——既耗费人力收集,又占用高昂计算资源。本文提出一种名为"可观测传播"(简称ObsProp)的新型方法,可在几乎不依赖数据的情况下,挖掘Transformer语言模型在计算特定任务时使用的线性特征。我们的研究范式基于"可观测量"这一核心概念,即对应特定任务的线性泛函。随后我们建立了特征向量分析的数学理论:为LayerNorm非线性不改变特征向量方向提供了理论依据;同时引入特征向量间的相似度指标——耦合系数,用于评估某一特征输出与另一特征输出的相关程度。我们利用ObsProp对多项任务进行了深入的定性研究,包括性别职业偏见、政党预测和编程语言检测。结果表明,ObsProp在低数据场景下超越了传统特征向量挖掘方法,并可用于深入理解大语言模型产生偏见的潜在机制。实验代码详见github.com/jacobdunefsky/ObservablePropagation。