Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders

from arxiv, 42 pages, 43 figures, under review. Extensive supplementary materials. Code and models available at https://huggingface.co/collections/CausalNLP/multilingual-tinystories-6862b6562414eb84d183f82a and https://huggingface.co/collections/CausalNLP/multilingual-gpt2-models and https://huggingface.co/collections/CausalNLP/multilingual-clts and https://github.com/abirharrasse/MultilingualCLTs

Multilingual Large Language Models (LLMs) can process many languages, yet how they internally represent this diversity remains unclear. Do they form shared multilingual representations with language-specific decoding, and if so, why does performance favor the dominant training language? To address this, we train models on different multilingual mixtures and analyze their internal mechanisms using Cross-Layer Transcoders (CLTs) and Attribution Graphs. Our results reveal multilingual shared representations: the model employs highly similar features across languages, while language-specific decoding emerges in later layers. Training models without English shows identical multilingual shared space structures. Decoding relies partly on a small set of high-frequency features in the final layers, which linearly encode language identity from early layers. Intervening on these features allows one language to be suppressed and another substituted. Finally, to explain non-English failures, we perform a Model-Diffing experiment: underperformance arises from dim late-layer features, weak middle-layer clusters, and tokenizer bias toward English that forces early layers to specialize in word reassembly. Finetuning strengthens these features and their links, improving token assembly and language-specific decoding, providing a mechanistic explanation for multilingual gaps. Our models and CLTs are available at https://huggingface.co/collections/CausalNLP/multilingual-clts and https://huggingface.co/collections/CausalNLP/multilingual-gpt2-models. Our code is available at: https://github.com/abirharrasse/MultilingualCLTs

翻译：多语言大语言模型（LLMs）能够处理多种语言，但其内部如何表征这种多样性仍不明确。它们是否形成了共享的多语言表征并辅以语言特定的解码机制？若是如此，为何模型性能往往偏向主导训练语言？为探究此问题，我们在不同的多语言混合数据上训练模型，并利用跨层转码器（CLTs）与归因图分析其内部机制。研究结果揭示了多语言共享表征的存在：模型在不同语言间使用高度相似的特征，而语言特定的解码则主要出现在较深层。在无英语数据训练时，模型仍展现出相同的多语言共享空间结构。解码过程部分依赖于最终层中一小部分高频特征，这些特征从早期层开始线性编码语言身份信息。通过对这些特征进行干预，可以实现对一种语言的抑制与另一种语言的替换。最后，为解释非英语语言性能不足的现象，我们进行了模型差异分析实验：性能不佳源于深层特征激活微弱、中层特征聚类松散，以及分词器对英语的偏向性迫使早期层专门进行词汇重组。微调能够强化这些特征及其关联，从而改善词汇组装与语言特定解码能力，为多语言性能差距提供了机制性解释。我们的模型与跨层转码器已发布于 https://huggingface.co/collections/CausalNLP/multilingual-clts 与 https://huggingface.co/collections/CausalNLP/multilingual-gpt2-models。代码开源地址：https://github.com/abirharrasse/MultilingualCLTs