Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic nature makes it difficult to isolate language-specific units from cross-lingual representations. To address this, we explore sparse autoencoders (SAEs) for their ability to learn monosemantic features that represent concrete and abstract concepts across languages in LLMs. While some of these features are language-independent, the presence of language-specific features remains underexplored. In this work, we introduce SAE-LAPE, a method based on feature activation probability, to identify language-specific features within the feed-forward network. We find that many such features predominantly appear in the middle to final layers of the model and are interpretable. These features influence the model's multilingual performance and language output and can be used for language identification with performance comparable to fastText along with more interpretability. Our code and complete figures are available at https://github.com/LyzanderAndrylie/language-specific-features
翻译:理解大型语言模型(LLM)的多语言机制有助于揭示其如何处理不同语言,但这仍然具有挑战性。现有研究通常关注单个神经元,但其多义性使得难以从跨语言表示中分离出语言特定单元。为解决此问题,我们探索了稀疏自编码器(SAE)学习单义特征的能力,这些特征代表了LLM中跨语言的具体和抽象概念。尽管其中一些特征是语言无关的,但语言特定特征的存在仍未得到充分探索。在本工作中,我们提出了SAE-LAPE,一种基于特征激活概率的方法,用于识别前馈网络中的语言特定特征。我们发现许多此类特征主要出现在模型的中间至最终层,并且是可解释的。这些特征影响模型的多语言性能和语言输出,并可用于语言识别,其性能与fastText相当且具有更强的可解释性。我们的代码和完整图表可在 https://github.com/LyzanderAndrylie/language-specific-features 获取。