Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics with unclear practical relevance. We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across seven diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we open-source a suite of over 200 SAEs across eight recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance. For instance, while Matryoshka SAEs slightly underperform on existing proxy metrics, they substantially outperform other architectures on feature disentanglement metrics; moreover, this advantage grows with SAE scale. By providing a standardized framework for measuring progress in SAE development, SAEBench enables researchers to study scaling trends and make nuanced comparisons between different SAE architectures and training methodologies. Our interactive interface enables researchers to flexibly visualize relationships between metrics across hundreds of open-source SAEs at: https://saebench.xyz
翻译:稀疏自编码器(SAE)是解释语言模型激活的一种常用技术,近期已有大量工作致力于提升SAE的有效性。然而,现有研究大多通过实际相关性不明确的非监督代理指标来评估进展。本文提出SAEBench,这是一个综合评估套件,通过涵盖可解释性、特征解耦以及遗忘等实际应用在内的七种多样化指标来衡量SAE性能。为支持系统化比较,我们开源了包含八种最新提出的SAE架构与训练算法、总计超过200个SAE的评估套件。实验表明,代理指标的提升并不能可靠地转化为实际性能的改进。例如,虽然Matryoshka SAE在现有代理指标上略逊一筹,但在特征解耦指标上显著优于其他架构;且该优势随SAE规模扩大而增强。通过提供衡量SAE发展进展的标准化框架,SAEBench使研究者能够分析扩展趋势,并对不同SAE架构与训练方法进行精细化比较。我们的交互式界面支持研究者在 https://saebench.xyz 灵活可视化数百个开源SAE的指标关联。