Sparse Autoencoders (SAEs) have been successfully used to probe Large Language Models (LLMs) and extract interpretable concepts from their internal representations. These concepts are linear combinations of neuron activations that correspond to human-interpretable features. In this paper, we investigate the effectiveness of SAE-based explainability approaches for sentence classification, a domain where such methods have not been extensively explored. We present a novel SAE-based model ClassifSAE tailored for text classification, leveraging a specialized classifier head and incorporating an activation rate sparsity loss. We benchmark this architecture against established methods such as ConceptShap, Independent Component Analysis, HI-Concept and a standard TopK-SAE baseline. Our evaluation covers several classification benchmarks and backbone LLMs. We further enrich our analysis with two novel metrics for measuring the precision of concept-based explanations, using an external sentence encoder. Our empirical results show that ClassifSAE improves both the causality and interpretability of the extracted features.
翻译:稀疏自编码器(Sparse Autoencoders, SAEs)已成功用于探测大语言模型(Large Language Models, LLMs),并从中提取可解释的概念。这些概念是神经元激活的线性组合,对应人类可理解的特征。本文研究了基于SAE的可解释性方法在句子分类任务中的有效性,该领域目前尚未被广泛探索。我们提出了一种专为文本分类设计的新型SAE模型ClassifSAE,该模型利用专用分类器头部并结合激活率稀疏性损失。我们将该架构与现有方法(如ConceptShap、独立成分分析、HI-Concept以及标准的TopK-SAE基线)进行了基准比较。评估涵盖多个分类基准数据集和骨干LLM。此外,我们引入两种基于外部句子编码器的新指标,用于衡量基于概念的解释的精确度,从而进一步丰富了分析。实验结果表明,ClassifSAE在提升所提取特征的因果性和可解释性方面均表现出优势。