Recent interpretability work on large language models (LLMs) has been increasingly dominated by a feature-discovery approach with the help of proxy modules. Then, the quality of features learned by, e.g., sparse auto-encoders (SAEs), is evaluated. This paradigm naturally raises a critical question: do such learned features have better properties than those already represented within the original model parameters, and unfortunately, only a few studies have made such comparisons systematically so far. In this work, we revisit the interpretability of feature vectors stored in feed-forward (FF) layers, given the perspective of FF as key-value memories, with modern interpretability benchmarks. Our extensive evaluation revealed that SAE and FFs exhibits a similar range of interpretability, although SAEs displayed an observable but minimal improvement in some aspects. Furthermore, in certain aspects, surprisingly, even vanilla FFs yielded better interpretability than the SAEs, and features discovered in SAEs and FFs diverged. These bring questions about the advantage of SAEs from both perspectives of feature quality and faithfulness, compared to directly interpreting FF feature vectors, and FF key-value parameters serve as a strong baseline in modern interpretability research.
翻译:近期针对大语言模型(LLM)的可解释性研究日益被基于代理模块的特征发现方法所主导。随后,研究者会对例如稀疏自编码器(SAE)所学习到的特征质量进行评估。这种范式自然引出了一个关键问题:此类学习到的特征是否比原始模型参数中已表示的特征具有更好的性质?遗憾的是,迄今为止仅有少数研究对此进行了系统性的比较。在本工作中,我们结合现代可解释性基准,从将前馈(FF)层视为键值记忆的视角出发,重新审视存储于FF层中的特征向量的可解释性。我们广泛的评估表明,SAE与FF层表现出相似范围的可解释性,尽管SAE在某些方面显示出可观测但微小的改进。此外,在某些方面,令人惊讶的是,即使是原始FF层也产生了比SAE更好的可解释性,并且SAE与FF层中发现的特征存在差异。这些发现从特征质量和忠实性两个角度,对相较于直接解释FF特征向量而言SAE的优势提出了疑问,同时也表明FF层的键值参数可作为现代可解释性研究中一个强有力的基线。