Improving Sparse Autoencoders (SAEs) requires benchmarks that can precisely validate architectural innovations. However, current SAE benchmarks on LLMs are often too noisy to differentiate architectural improvements, and current synthetic data experiments are too small-scale and unrealistic to provide meaningful comparisons. We introduce SynthSAEBench, a toolkit for generating large-scale synthetic data with realistic feature characteristics including correlation, hierarchy, and superposition, and a standardized benchmark model, SynthSAEBench-16k, enabling direct comparison of SAE architectures. Our benchmark reproduces several previously observed LLM SAE phenomena, including the disconnect between reconstruction and latent quality metrics, poor SAE probing results, and a precision-recall trade-off mediated by L0. We further use our benchmark to identify a new failure mode: Matching Pursuit SAEs exploit superposition noise to improve reconstruction without learning ground-truth features, suggesting that more expressive encoders can easily overfit. SynthSAEBench complements LLM benchmarks by providing ground-truth features and controlled ablations, enabling researchers to precisely diagnose SAE failure modes and validate architectural improvements before scaling to LLMs.
翻译:改进稀疏自编码器(SAEs)需要能够精确验证架构创新的基准测试。然而,当前基于大语言模型(LLMs)的SAE基准测试通常噪声过大,难以区分架构改进;而现有的合成数据实验规模过小且不够真实,无法提供有意义的比较。我们提出了SynthSAEBench,这是一个用于生成具有相关性、层次性和叠加性等真实特征特性的大规模合成数据的工具包,并包含一个标准化基准模型SynthSAEBench-16k,支持对SAE架构进行直接比较。我们的基准复现了先前在LLM SAE中观察到的若干现象,包括重建质量与潜在特征质量度量之间的脱节、较差的SAE探测结果,以及由L0调节的精度-召回权衡。我们进一步利用该基准发现了一种新的失效模式:匹配追踪SAE通过利用叠加噪声来改进重建,而并未学习真实特征,这表明表达能力更强的编码器容易过拟合。SynthSAEBench通过提供真实特征和可控的消融实验,补充了LLM基准测试,使研究人员能够在扩展到LLMs之前,精确诊断SAE的失效模式并验证架构改进。