Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic activations into sparse sets of monosemantic features, SAEs aim to translate neural network computations into human-understandable concepts. However, common architectures such as TopK SAEs rely on a fixed sparsity level. They enforce the same number of active features (K) across all inputs, ignoring the varying complexity of real-world data. Natural data often lies on manifolds with varying local intrinsic dimensionality, meaning the number of relevant factors can change significantly across samples. This suggests that a fixed sparsity level is not optimal. Simple inputs may require only a few features, while more complex ones need more expressive representations. Using a constant K can therefore introduce noise in simple cases or miss important structure in more complex ones. To address this issue, we propose SoftSAE, a sparse autoencoder with a Dynamic Top-K selection mechanism. Our method uses a differentiable Soft Top-K operator to learn an input-dependent sparsity level k. This allows the model to adjust the number of active features based on the complexity of each input. As a result, the representation better matches the structure of the data, and the explanation length reflects the amount of information in the input. Experimental results confirm that SoftSAE not only finds meaningful features, but also selects the right number of features for each concept. The source code is available at: https://anonymous.4open.science/r/SoftSAE-8F71/.
翻译:稀疏自编码器已成为机制可解释性的重要工具,有助于分析大型语言模型和视觉Transformer中的内部表征。通过将多语义激活分解为稀疏的单语义特征集,SAE旨在将神经网络计算转化为人类可理解的概念。然而,TopK SAE等常见结构依赖于固定的稀疏度水平。它们对所有输入强制采用相同数量的激活特征,忽略了现实世界数据的复杂度差异。自然数据通常位于具有不同局部内在维度的流形上,这意味着相关因素的数量可能因样本而异。这表明固定稀疏度水平并非最优方案:简单输入可能仅需少量特征,而复杂输入则需要更丰富的表征。使用恒定K值在简单情况下可能引入噪声,在复杂情况下则可能遗漏重要结构。为解决此问题,我们提出SoftSAE——一种具有动态Top-K选择机制的稀疏自编码器。该方法利用可微分的Soft Top-K算子学习依赖输入的稀疏度水平k,使模型能根据每个输入的复杂程度调整激活特征数量。由此,表征能更好地匹配数据结构,解释长度也反映了输入中的信息量。实验结果表明,SoftSAE不仅能发现有意义的特征,还能为每个概念选择恰当数量的特征。源代码地址:https://anonymous.4open.science/r/SoftSAE-8F71/