Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting language model activations by decomposing them into sparse, interpretable features. A popular approach is the TopK SAE, that uses a fixed number of the most active latents per sample to reconstruct the model activations. We introduce BatchTopK SAEs, a training method that improves upon TopK SAEs by relaxing the top-k constraint to the batch-level, allowing for a variable number of latents to be active per sample. As a result, BatchTopK adaptively allocates more or fewer latents depending on the sample, improving reconstruction without sacrificing average sparsity. We show that BatchTopK SAEs consistently outperform TopK SAEs in reconstructing activations from GPT-2 Small and Gemma 2 2B, and achieve comparable performance to state-of-the-art JumpReLU SAEs. However, an advantage of BatchTopK is that the average number of latents can be directly specified, rather than approximately tuned through a costly hyperparameter sweep. We provide code for training and evaluating BatchTopK SAEs at https://github.com/bartbussmann/BatchTopK
翻译:稀疏自编码器(SAE)已成为一种通过将语言模型激活分解为稀疏、可解释特征来解释这些激活的强大工具。TopK SAE是一种流行的方法,它使用每个样本中固定数量的最活跃潜在变量来重建模型激活。我们提出了BatchTopK SAE,这是一种改进TopK SAE的训练方法,它将top-k约束放宽到批次级别,允许每个样本中活跃的潜在变量数量可变。因此,BatchTopK根据样本自适应地分配更多或更少的潜在变量,在不牺牲平均稀疏性的情况下改善了重建效果。我们证明,在重建GPT-2 Small和Gemma 2 2B的激活方面,BatchTopK SAE始终优于TopK SAE,并且达到了与最先进的JumpReLU SAE相当的性能。然而,BatchTopK的一个优势在于,平均潜在变量数量可以直接指定,而不是通过成本高昂的超参数扫描进行近似调整。我们在https://github.com/bartbussmann/BatchTopK 提供了用于训练和评估BatchTopK SAE的代码。