We investigate feature universality in large language models (LLMs), a research field that aims to understand how different models similarly represent concepts in the latent spaces of their intermediate layers. Demonstrating feature universality allows discoveries about latent representations to generalize across several models. However, comparing features across LLMs is challenging due to polysemanticity, in which individual neurons often correspond to multiple features rather than distinct ones, making it difficult to disentangle and match features across different models. To address this issue, we employ a method known as dictionary learning by using sparse autoencoders (SAEs) to transform LLM activations into more interpretable spaces spanned by neurons corresponding to individual features. After matching feature neurons across models via activation correlation, we apply representational space similarity metrics on SAE feature spaces across different LLMs. Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.
翻译:我们研究大型语言模型(LLMs)中的特征通用性,这一研究领域旨在理解不同模型如何在其中间层的潜在空间中相似地表征概念。证明特征通用性可以使关于潜在表征的发现能够推广到多个模型。然而,由于多义性(即单个神经元通常对应多个特征而非单一特征),比较不同LLMs间的特征具有挑战性,这使得在不同模型间分离和匹配特征变得困难。为解决此问题,我们采用一种称为字典学习的方法,通过使用稀疏自编码器(SAEs)将LLM激活转换为由对应单个特征的神经元所张成的更可解释的空间。在通过激活相关性跨模型匹配特征神经元后,我们在不同LLMs的SAE特征空间上应用表征空间相似性度量。我们的实验揭示了不同LLMs在SAE特征空间上存在显著相似性,为特征通用性提供了新的证据。