稀疏自编码器揭示大型语言模型中的通用特征空间 (Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models)

We investigate feature universality in large language models (LLMs), a research field that aims to understand how different models similarly represent concepts in the latent spaces of their intermediate layers. Demonstrating feature universality allows discoveries about latent representations to generalize across several models. However, comparing features across LLMs is challenging due to polysemanticity, in which individual neurons often correspond to multiple features rather than distinct ones, making it difficult to disentangle and match features across different models. To address this issue, we employ a method known as dictionary learning by using sparse autoencoders (SAEs) to transform LLM activations into more interpretable spaces spanned by neurons corresponding to individual features. After matching feature neurons across models via activation correlation, we apply representational space similarity metrics on SAE feature spaces across different LLMs. Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.

翻译：我们研究大型语言模型（LLMs）中的特征通用性，这一研究领域旨在理解不同模型如何在其中间层的潜在空间中相似地表征概念。证明特征通用性可以使关于潜在表征的发现能够推广到多个模型。然而，由于多义性（即单个神经元通常对应多个特征而非单一特征），比较不同LLMs间的特征具有挑战性，这使得在不同模型间分离和匹配特征变得困难。为解决此问题，我们采用一种称为字典学习的方法，通过使用稀疏自编码器（SAEs）将LLM激活转换为由对应单个特征的神经元所张成的更可解释的空间。在通过激活相关性跨模型匹配特征神经元后，我们在不同LLMs的SAE特征空间上应用表征空间相似性度量。我们的实验揭示了不同LLMs在SAE特征空间上存在显著相似性，为特征通用性提供了新的证据。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日