A central question in multilingual language modeling is whether large language models (LLMs) develop a universal concept representation, disentangled from specific languages. In this paper, we address this question by analyzing latent representations (latents) during a word translation task in transformer-based LLMs. We strategically extract latents from a source translation prompt and insert them into the forward pass on a target translation prompt. By doing so, we find that the output language is encoded in the latent at an earlier layer than the concept to be translated. Building on this insight, we conduct two key experiments. First, we demonstrate that we can change the concept without changing the language and vice versa through activation patching alone. Second, we show that patching with the mean over latents across different languages does not impair and instead improves the models' performance in translating the concept. Our results provide evidence for the existence of language-agnostic concept representations within the investigated models.
翻译:多语言语言建模中的一个核心问题是大型语言模型(LLMs)是否发展出了一种与特定语言解耦的通用概念表征。本文通过分析基于Transformer的LLMs在单词翻译任务中的潜在表征(latents)来探讨这一问题。我们策略性地从源翻译提示中提取潜在表征,并将其插入到目标翻译提示的前向传播过程中。通过这种方法,我们发现输出语言在潜在表征中的编码层早于待翻译概念的编码层。基于这一发现,我们进行了两个关键实验。首先,我们证明仅通过激活修补即可实现改变概念而不改变语言,反之亦然。其次,我们表明使用跨语言潜在表征的均值进行修补不仅不会损害模型在概念翻译中的性能,反而能提升其表现。我们的结果为所研究模型内部存在语言无关概念表征提供了证据。