Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the base LLM's activations. Overall, our results suggest that SAE concept representations are fragile and without further denoising or postprocessing they might be ill-suited for applications in model monitoring and oversight.
翻译:稀疏自编码器(SAE)通常用于解释大语言模型(LLM)的内部激活,通过将其映射到人类可解释的概念表示。虽然现有对SAE的评估侧重于重建-稀疏性权衡、人类(自动)可解释性和特征解耦等指标,但它们忽略了一个关键方面:概念表示对输入扰动的鲁棒性。我们认为鲁棒性必须是概念表示的基本考量,反映概念标注的保真度。为此,我们将鲁棒性量化表述为输入空间优化问题,并开发了一个全面的评估框架,该框架包含通过精心设计的对抗扰动来操纵SAE表示的现实场景。通过实证研究,我们发现微小的对抗性输入扰动在大多数场景中能有效操纵基于概念的解释,而不会显著影响基础LLM的激活。总体而言,我们的结果表明SAE概念表示是脆弱的,若没有进一步的去噪或后处理,它们可能不适合用于模型监控与监督的应用。