Large language models (LLMs) have shown significant multilingual capabilities. However, the mechanisms underlying the development of these capabilities during pre-training are not well understood. In this paper, we use code LLMs as an experimental platform to explore the evolution of multilingual capabilities in LLMs during the pre-training process. Based on our observations, we propose the Babel Tower Hypothesis, which describes the entire process of LLMs acquiring new language capabilities. During the learning process, multiple languages initially share a single knowledge system dominated by the primary language and gradually develop language-specific knowledge systems. We then validate the above hypothesis by tracking the internal states of the LLMs through identifying working languages and language transferring neurons. Experimental results show that the internal state changes of the LLM are consistent with our Babel Tower Hypothesis. Building on these insights, we propose a novel method to construct an optimized pre-training corpus for multilingual code LLMs, which significantly outperforms LLMs trained on the original corpus. The proposed Babel Tower Hypothesis provides new insights into designing pre-training data distributions to achieve optimal multilingual capabilities in LLMs.
翻译:大语言模型(LLMs)已展现出显著的多语言能力。然而,这些能力在预训练过程中的发展机制尚未得到充分理解。本文以代码LLMs为实验平台,探究LLMs在预训练过程中多语言能力的演化。基于我们的观察,我们提出了巴别塔假说,该假说描述了LLMs获取新语言能力的全过程。在学习过程中,多种语言最初共享一个以主导语言为主的知识系统,并逐渐发展出语言特定的知识系统。随后,我们通过识别工作语言和语言传递神经元来追踪LLMs的内部状态,从而验证上述假说。实验结果表明,LLM内部状态的变化与我们的巴别塔假说一致。基于这些发现,我们提出了一种新颖的方法来构建针对多语言代码LLMs的优化预训练语料库,该方法显著优于在原始语料库上训练的LLMs。所提出的巴别塔假说为设计预训练数据分布以实现LLMs最优多语言能力提供了新的见解。