Cross-lingual transfer learning is an important property of multilingual large language models (LLMs). But how do LLMs represent relationships between languages? Every language model has an input layer that maps tokens to vectors. This ubiquitous layer of language models is often overlooked. We find that similarities between these input embeddings are highly interpretable and that the geometry of these embeddings differs between model families. In one case (XLM-RoBERTa), embeddings encode language: tokens in different writing systems can be linearly separated with an average of 99.2% accuracy. Another family (mT5) represents cross-lingual semantic similarity: the 50 nearest neighbors for any token represent an average of 7.61 writing systems, and are frequently translations. This result is surprising given that there is no explicit parallel cross-lingual training corpora and no explicit incentive for translations in pre-training objectives. Our research opens the door for investigations in 1) The effect of pre-training and model architectures on representations of languages and 2) The applications of cross-lingual representations embedded in language models.
翻译:跨语言迁移学习是多语言大语言模型(LLMs)的重要性质。但LLMs如何表征语言之间的关系?每个语言模型都有一个将标记映射为向量的输入层。这一语言模型的通用层常被忽视。我们发现这些输入嵌入之间的相似性具有高度可解释性,且不同模型家族的嵌入几何结构存在差异。在XLM-RoBERTa中,嵌入编码了语言信息:不同书写系统中的标记可通过线性分类器以平均99.2%的准确率分离。而mT5家族则表征跨语言语义相似性:任意标记的50个最近邻平均涉及7.61种书写系统,且常为翻译对。这一结果令人惊讶,因为训练数据中既无显式平行跨语言语料库,预训练目标也未对翻译任务进行显式激励。本研究为以下方向开辟了探索路径:1)预训练策略与模型架构对语言表征的影响;2)语言模型中嵌入的跨语言表征的应用。