We ask whether multilingual language models trained on unbalanced, English-dominated corpora use English as an internal pivot language -- a question of key importance for understanding how language models function and the origins of linguistic bias. Focusing on the Llama-2 family of transformer models, our study uses carefully constructed non-English prompts with a unique correct single-token continuation. From layer to layer, transformers gradually map an input embedding of the final prompt token to an output embedding from which next-token probabilities are computed. Tracking intermediate embeddings through their high-dimensional space reveals three distinct phases, whereby intermediate embeddings (1) start far away from output token embeddings; (2) already allow for decoding a semantically correct next token in the middle layers, but give higher probability to its version in English than in the input language; (3) finally move into an input-language-specific region of the embedding space. We cast these results into a conceptual model where the three phases operate in "input space", "concept space", and "output space", respectively. Crucially, our evidence suggests that the abstract "concept space" lies closer to English than to other languages, which may have important consequences regarding the biases held by multilingual language models.
翻译:摘要:我们探究了在不平衡的、以英语为主导的语料上训练的多语言语言模型是否将英语作为内部枢纽语言——这一问题的解答对理解语言模型的工作机制及语言偏见的根源至关重要。本研究聚焦于Llama-2系列Transformer模型,采用精心构造的非英语提示,其对应唯一正确的单词元续接。在逐层处理过程中,Transformer逐步将提示末尾词元的输入嵌入映射为输出嵌入,并据此计算后续词元的概率。通过追踪高维空间中的中间嵌入,我们观察到三个不同阶段:中间嵌入(1)初始远离输出词元嵌入;(2)在中间层已能解码出语义正确的下一词元,但赋予其英语版本的概率高于输入语言版本;(3)最终进入输入语言特定的嵌入空间区域。我们据此构建概念模型,将这三个阶段分别对应"输入空间"、"概念空间"和"输出空间"的操作。关键证据表明,抽象的"概念空间"更接近英语而非其他语言,这一发现可能对多语言语言模型所持偏见的形成具有重要启示。