Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: https://github.com/shan23chen/MOSAIC.
翻译:稀疏自编码器(SAEs)为揭示大型语言模型(LLMs)中结构化、人类可解释的表征提供了潜力,使其成为构建透明可控人工智能系统的关键工具。我们系统分析了SAE在安全关键分类任务中从LLMs提取可解释特征的能力。我们的框架评估了:(1)模型层选择与缩放特性,(2)SAE架构配置(包括宽度与池化策略),以及(3)连续SAE激活值二值化的影响。SAE提取的特征实现了宏F1分数>0.8,优于隐藏状态和词袋基线模型,并展示了从Gemma 2 2B到9B-IT模型的跨模型迁移能力。这些特征能够以零样本方式泛化至跨语言毒性检测和视觉分类任务。我们的分析强调了池化策略与二值化阈值的显著影响,表明二值化为传统特征选择提供了高效替代方案,同时保持或提升了性能。这些发现为基于SAE的可解释性研究确立了新的最佳实践,并支持LLM在现实应用中的可扩展透明部署。完整代码库:https://github.com/shan23chen/MOSAIC。