Sparse autoencoders (SAEs) have gained a lot of attention as a promising tool to improve the interpretability of large language models (LLMs) by mapping the complex superposition of polysemantic neurons into monosemantic features and composing a sparse dictionary of words. However, traditional performance metrics like Mean Squared Error and L0 sparsity ignore the evaluation of the semantic representational power of SAEs -- whether they can acquire interpretable monosemantic features while preserving the semantic relationship of words. For instance, it is not obvious whether a learned sparse feature could distinguish different meanings in one word. In this paper, we propose a suite of evaluations for SAEs to analyze the quality of monosemantic features by focusing on polysemous words. Our findings reveal that SAEs developed to improve the MSE-L0 Pareto frontier may confuse interpretability, which does not necessarily enhance the extraction of monosemantic features. The analysis of SAEs with polysemous words can also figure out the internal mechanism of LLMs; deeper layers and the Attention module contribute to distinguishing polysemy in a word. Our semantics focused evaluation offers new insights into the polysemy and the existing SAE objective and contributes to the development of more practical SAEs.
翻译:稀疏自编码器(SAEs)作为一种提升大型语言模型(LLMLMs)可解释性的工具受到广泛关注,其通过将多义神经元的复杂叠加映射为单义特征,并构建稀疏词典。然而,传统性能指标(如均方误差和L0稀疏度)忽略了评估SAEs的语义表征能力——即能否在保持词语语义关系的同时获得可解释的单义特征。例如,目前尚不明确学习到的稀疏特征能否区分同一词语的不同含义。本文提出一套针对SAEs的评估方法,通过聚焦多义词来分析单义特征的质量。研究发现,为优化MSE-L0帕累托前沿而开发的SAEs可能混淆可解释性,这并不必然增强单义特征的提取。通过多义词分析SAEs还能揭示LLMs的内部机制:深层网络和注意力模块有助于区分词语中的多义现象。我们以语义为核心的评估为多义词研究及现有SAE目标提供了新见解,并有助于开发更具实用价值的SAEs。