Sparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quality metrics in SAEBench, the de-facto standard SAE evaluation suite, through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs. The other metrics show higher reseed noise and lower discriminability than the field assumes. The sae-probes variant of $k$-sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE architecture. Our results show the field needs better SAE benchmarks.
翻译:稀疏自编码器(SAEs)是大型语言模型的核心可解释性工具,而SAE架构的进展依赖于能够可靠区分优劣SAE的基准测试。我们通过三个互补视角审计了SAEBench(事实上的标准SAE评估套件)中的SAE质量指标:固定SAE的重新播种噪声、合成SAE的真实相关性以及训练轨迹的判别能力。我们发现其中两个指标——目标探针扰动(TPP)和虚假相关性消除(SCR)——在其标准设置下未能通过多个视角的检验,不应用于评估SAE。其他指标表现出比学界假设更高的重新播种噪声和更低的判别能力。$k$-稀疏探针的sae-probes变体是我们测试中最可靠的指标,但即使sae-probes也难以区分同一SAE架构的不同变体。我们的结果表明,学界需要更好的SAE基准测试。