We ask whether multilingual language models trained on unbalanced, English-dominated corpora use English as an internal pivot language -- a question of key importance for understanding how language models function and the origins of linguistic bias. Focusing on the Llama-2 family of transformer models, our study uses carefully constructed non-English prompts with a unique correct single-token continuation. From layer to layer, transformers gradually map an input embedding of the final prompt token to an output embedding from which next-token probabilities are computed. Tracking intermediate embeddings through their high-dimensional space reveals three distinct phases, whereby intermediate embeddings (1) start far away from output token embeddings; (2) already allow for decoding a semantically correct next token in the middle layers, but give higher probability to its version in English than in the input language; (3) finally move into an input-language-specific region of the embedding space. We cast these results into a conceptual model where the three phases operate in "input space", "concept space", and "output space", respectively. Crucially, our evidence suggests that the abstract "concept space" lies closer to English than to other languages, which may have important consequences regarding the biases held by multilingual language models.
翻译:本研究探讨了在英语主导的不平衡语料库上训练的多语言语言模型是否将英语作为内部枢纽语言——这一关键问题对于理解语言模型的工作原理及语言偏见的起源至关重要。以Llama-2系列Transformer模型为研究对象,我们设计了精心构建的非英语提示词,这些提示词均具有唯一正确的单标记延续。Transformer模型通过逐层处理,逐步将最终提示词标记的输入嵌入映射至计算下一标记概率的输出嵌入。通过追踪高维空间中的中间嵌入过程,我们发现了三个显著阶段:中间嵌入(1)起始阶段远离输出标记嵌入;(2)在中间层已能解码出语义正确的下一标记,但赋予其英语版本的概率高于输入语言版本;(3)最终迁移至嵌入空间中特定于输入语言的区域。我们将这些结果归纳为概念模型,其中三个阶段分别在“输入空间”、“概念空间”和“输出空间”中运作。关键证据表明,抽象的“概念空间”更接近英语而非其他语言,这可能对多语言语言模型所持有的偏见产生重要影响。