Sparse autoencoders (SAEs) are a promising approach to extracting features from neural networks, enabling model interpretability as well as causal interventions on model internals. SAEs generate sparse feature representations using a sparsifying activation function that implicitly defines a set of token-feature matches. We frame the token-feature matching as a resource allocation problem constrained by a total sparsity upper bound. For example, TopK SAEs solve this allocation problem with the additional constraint that each token matches with at most $k$ features. In TopK SAEs, the $k$ active features per token constraint is the same across tokens, despite some tokens being more difficult to reconstruct than others. To address this limitation, we propose two novel SAE variants, Feature Choice SAEs and Mutual Choice SAEs, which each allow for a variable number of active features per token. Feature Choice SAEs solve the sparsity allocation problem under the additional constraint that each feature matches with at most $m$ tokens. Mutual Choice SAEs solve the unrestricted allocation problem where the total sparsity budget can be allocated freely between tokens and features. Additionally, we introduce a new auxiliary loss function, $\mathtt{aux\_zipf\_loss}$, which generalises the $\mathtt{aux\_k\_loss}$ to mitigate dead and underutilised features. Our methods result in SAEs with fewer dead features and improved reconstruction loss at equivalent sparsity levels as a result of the inherent adaptive computation. More accurate and scalable feature extraction methods provide a path towards better understanding and more precise control of foundation models.
翻译:稀疏自编码器(SAEs)是一种从神经网络中提取特征的有效方法,能够提升模型的可解释性,并支持对模型内部进行因果干预。SAEs通过稀疏化激活函数生成稀疏特征表示,该函数隐式定义了一组标记-特征匹配。我们将标记-特征匹配问题建模为受总稀疏度上界约束的资源分配问题。例如,TopK SAE在附加约束(每个标记最多匹配 $k$ 个特征)下求解该分配问题。在TopK SAE中,每个标记的 $k$ 个活跃特征约束对所有标记均相同,尽管某些标记的重建难度高于其他标记。为克服这一局限,我们提出了两种新颖的SAE变体:特征选择SAE与互选SAE,它们均允许每个标记的活跃特征数量可变。特征选择SAE在附加约束(每个特征最多匹配 $m$ 个标记)下求解稀疏分配问题。互选SAE则求解无约束分配问题,允许总稀疏度预算在标记与特征之间自由分配。此外,我们引入了一种新的辅助损失函数 $\mathtt{aux\_zipf\_loss}$,该函数推广了 $\mathtt{aux\_k\_loss}$,以缓解特征死亡与利用不足的问题。得益于固有的自适应计算机制,我们的方法能够在相同稀疏度水平下实现更少的死亡特征与更优的重建损失。更精确且可扩展的特征提取方法为深入理解基础模型并实现更精细的控制提供了可行路径。