Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against \emph{supervised} feature dictionaries. First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. Second, we use the supervised dictionaries to develop and contextualize evaluations of unsupervised dictionaries along the same three axes. We apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or OpenWebText datasets. We find that these SAEs capture interpretable features for the IOI task, but they are less successful than supervised features in controlling the model. Finally, we observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is robustly overshadowed by even slightly higher-magnitude ones in the learned features), and feature over-splitting (where binary features split into many smaller, less interpretable features). We hope that our framework will provide a useful step towards more objective and grounded evaluations of sparse dictionary learning methods.
翻译:将模型激活解耦为有意义的特征是可解释性研究的核心问题。然而,在现实场景中缺乏这些特征的真实标签,使得验证诸如稀疏字典学习等最新方法变得困难。为应对这一挑战,我们提出一个框架,通过在特定任务场景中比较特征字典与*监督式*特征字典来评估其性能。首先,我们证明监督式字典在任务中对模型计算实现了优异的近似性、可控性和可解释性。其次,利用监督式字典,我们从相同三个维度出发,开发并定位无监督字典的评估方法。我们将此框架应用于GPT-2 Small的间接对象识别(IOI)任务,使用在IOI或OpenWebText数据集上训练的稀疏自编码器(SAEs)。研究发现,这些SAEs能够捕获IOI任务的可解释特征,但在控制模型方面不如监督式特征成功。最后,我们观察到SAE训练中的两个定性现象:特征遮蔽(因果相关概念在学习特征中被幅度稍高的特征显著掩盖)和特征过分割(二元特征分裂为许多更小、可解释性更低的特征)。我们希望该框架能为稀疏字典学习方法提供更客观、更接地气的评估方向。