We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention heads or neurons, rendering them unsuitable for many downstream applications. In contrast, sparse feature circuits enable detailed understanding of unanticipated mechanisms. Because they are based on fine-grained units, sparse feature circuits are useful for downstream tasks: We introduce SHIFT, where we improve the generalization of a classifier by ablating features that a human judges to be task-irrelevant. Finally, we demonstrate an entirely unsupervised and scalable interpretability pipeline by discovering thousands of sparse feature circuits for automatically discovered model behaviors.
翻译:我们提出了发现与应用稀疏特征电路的方法。这些电路是由人类可解释特征构成的因果关联子网络,用于解释语言模型的行为。先前研究中识别的电路通常由多义性且难以解释的单元组成(如注意力头或神经元),导致其不适合许多下游应用。相比之下,稀疏特征电路能够实现对意外机制的细致理解。由于基于细粒度单元,稀疏特征电路对下游任务同样具有实用价值:我们引入了SHIFT方法,通过消融人类判断为任务无关的特征,提升分类器的泛化能力。最后,我们通过自动发现数千个针对自动识别模型行为的稀疏特征电路,展示了一套完全无监督且可扩展的可解释性分析流程。