Sparse Autoencoders (SAEs) have emerged as a promising approach to decompose the activations of Large Language Models (LLMs) into human-interpretable latents. In this paper, we pose two questions. First, to what extent do SAEs extract monosemantic and interpretable latents? Second, to what extent does varying the sparsity or the size of the SAE affect monosemanticity / interpretability? By investigating these questions in the context of a simple first-letter identification task where we have complete access to ground truth labels for all tokens in the vocabulary, we are able to provide more detail than prior investigations. Critically, we identify a problematic form of feature-splitting we call feature absorption where seemingly monosemantic latents fail to fire in cases where they clearly should. Our investigation suggests that varying SAE size or sparsity is insufficient to solve this issue, and that there are deeper conceptual issues in need of resolution.
翻译:稀疏自编码器(SAEs)已成为将大语言模型(LLMs)的激活分解为人类可解释潜在特征的一种有前景的方法。本文提出两个问题:第一,SAEs 在多大程度上提取出单义且可解释的潜在特征?第二,改变 SAE 的稀疏度或规模如何影响其单义性/可解释性?通过在一个简单的首字母识别任务背景下研究这些问题——其中我们完全掌握词汇表中所有词元的真实标签——我们能够提供比以往研究更详细的分析。关键的是,我们发现了一种有问题的特征分裂形式,称之为特征吸收,即看似单义的潜在特征在明显应当激活的情况下却未能激活。我们的研究表明,仅调整 SAE 的规模或稀疏度不足以解决此问题,其中存在更深层的概念性问题亟待解决。