Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics with unclear practical relevance. We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across seven diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we open-source a suite of over 200 SAEs across eight recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance. For instance, while Matryoshka SAEs slightly underperform on existing proxy metrics, they substantially outperform other architectures on feature disentanglement metrics; moreover, this advantage grows with SAE scale. By providing a standardized framework for measuring progress in SAE development, SAEBench enables researchers to study scaling trends and make nuanced comparisons between different SAE architectures and training methodologies. Our interactive interface enables researchers to flexibly visualize relationships between metrics across hundreds of open-source SAEs at: https://saebench.xyz
翻译:稀疏自动编码器(SAE)是解释语言模型激活的一种常用技术,近期已有大量工作致力于提升SAE的效能。然而,现有研究大多通过实用相关性不明确的非监督代理指标来评估进展。本文提出SAEBench——一个综合评估套件,通过涵盖可解释性、特征解耦及遗忘等实际应用场景的七类多样化指标来衡量SAE性能。为支持系统化比较,我们开源了包含八种最新SAE架构与训练算法的超过200个SAE模型套件。评估结果表明,代理指标的提升并不能可靠转化为实际性能的改善。例如,Matryoshka SAE在现有代理指标上虽略逊于其他架构,但在特征解耦指标上显著优于其他方案,且该优势随SAE规模扩大而增强。通过提供标准化框架来衡量SAE发展进展,SAEBench使研究者能够系统研究扩展规律,并对不同SAE架构与训练方法进行精细化比较。我们的交互界面支持研究者在https://saebench.xyz 灵活可视化数百个开源SAE的指标关联关系。