Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders

from arxiv, 42 pages, 43 figures, under review. Extensive supplementary materials. Code and models available at https://huggingface.co/collections/CausalNLP/multilingual-tinystories-6862b6562414eb84d183f82a and https://huggingface.co/collections/CausalNLP/multilingual-gpt2-models and https://huggingface.co/collections/CausalNLP/multilingual-clts and https://github.com/abirharrasse/MultilingualCLTs

Multilingual Large Language Models (LLMs) can process many languages, yet how they internally represent this diversity remains unclear. Do they form shared multilingual representations with language-specific decoding, and if so, why does performance favor the dominant training language? To address this, we train models on different multilingual mixtures and analyze their internal mechanisms using Cross-Layer Transcoders (CLTs) and Attribution Graphs. Our results reveal multilingual shared representations: the model employs highly similar features across languages, while language-specific decoding emerges in later layers. Training models without English shows identical multilingual shared space structures. Decoding relies partly on a small set of high-frequency features in the final layers, which linearly encode language identity from early layers. Intervening on these features allows one language to be suppressed and another substituted. Finally, to explain non-English failures, we perform a Model-Diffing experiment: underperformance arises from dim late-layer features, weak middle-layer clusters, and tokenizer bias toward English that forces early layers to specialize in word reassembly. Finetuning strengthens these features and their links, improving token assembly and language-specific decoding, providing a mechanistic explanation for multilingual gaps. Our models and CLTs are available at https://huggingface.co/collections/CausalNLP/multilingual-clts and https://huggingface.co/collections/CausalNLP/multilingual-gpt2-models. Our code is available at: https://github.com/abirharrasse/MultilingualCLTs

翻译：多语言大语言模型（LLMs）能够处理多种语言，但其内部如何表征这种多样性仍不清楚。它们是否形成了共享的多语言表征，并辅以语言特定的解码？如果是这样，为何性能往往偏向于主导的训练语言？为探究此问题，我们在不同的多语言混合数据上训练模型，并使用跨层转码器（CLTs）和归因图分析其内部机制。我们的结果揭示了多语言共享表征的存在：模型在不同语言间使用了高度相似的特征，而语言特定的解码则出现在较深层。在不使用英语数据训练的模型中，我们观察到了相同的多语言共享空间结构。解码部分依赖于最后一层中一小部分高频特征，这些特征从早期层开始线性编码语言身份。干预这些特征可以抑制一种语言并用另一种语言替代。最后，为解释非英语语言性能不佳的原因，我们进行了模型差分实验：性能不佳源于深层特征的微弱激活、中间层聚类结构的薄弱，以及分词器对英语的偏向——这迫使早期层专门进行词汇重组。微调能够强化这些特征及其连接，从而改善词汇组装和语言特定解码，为多语言性能差距提供了机制性解释。我们的模型和CLTs可在 https://huggingface.co/collections/CausalNLP/multilingual-clts 和 https://huggingface.co/collections/CausalNLP/multilingual-gpt2-models 获取。代码发布于：https://github.com/abirharrasse/MultilingualCLTs