Sparse autoencoders (SAEs) are widely used to extract sparse, interpretable latents from transformer activations. We test whether commonly used SAE quality metrics and automatic explanation pipelines can distinguish trained transformers from randomly initialized ones (e.g., where parameters are sampled i.i.d. from a Gaussian). Over a wide range of Pythia model sizes and multiple randomization schemes, we find that, in many settings, SAEs trained on randomly initialized transformers produce auto-interpretability scores and reconstruction metrics that are similar to those from trained models. These results show that high aggregate auto-interpretability scores do not, by themselves, guarantee that learned, computationally relevant features have been recovered. We therefore recommend treating common SAE metrics as useful but insufficient proxies for mechanistic interpretability and argue for routine randomized baselines and targeted measures of feature 'abstractness'.
翻译:稀疏自编码器(SAE)被广泛用于从Transformer激活中提取稀疏、可解释的潜在表示。本文检验了常用的SAE质量指标与自动解释流程能否区分训练后的Transformer与随机初始化的Transformer(例如参数独立同分布于高斯分布的情况)。通过对多种规模的Pythia模型及多种随机化方案的广泛测试,我们发现,在许多设定下,基于随机初始化Transformer训练的SAE所产生的自动可解释性分数与重构指标,与基于训练后模型得到的结果相似。这些结果表明,较高的整体自动可解释性分数本身并不能保证已成功提取出具有计算相关性的学习特征。因此,我们建议将常见的SAE指标视为机制可解释性有用但不充分的代理指标,并主张在研究中常规采用随机化基线检验,同时针对特征“抽象性”建立定向评估方法。