Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training corpus. We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. We conduct an analysis to investigate code-switching in the pre-training corpus, examining its presence and categorizing it into four types within two quadrants. We then assess its impact on multilingual performance. These types of code-switching data are unbalanced in proportions and demonstrate different effects on facilitating language transfer. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching. We continuously scale up the synthetic code-switching data and observe remarkable improvements in both benchmarks and representation space. Extensive experiments indicate that incorporating synthetic code-switching data enables better language alignment and generalizes well to high, medium, and low-resource languages with pre-training corpora of varying qualities.
翻译:大型语言模型(LLMs)尽管在预训练数据中存在极端的语言不平衡,却展现出卓越的多语言能力。本文中,我们深入探究了这一现象背后的原因,重点关注预训练语料库。我们发现,代码切换(即在同一上下文中交替使用不同语言)的存在是多语言能力的关键。我们进行分析以调查预训练语料库中的代码切换,检查其存在情况并将其划分为两个象限内的四种类型。随后,我们评估了其对多语言性能的影响。这些类型的代码切换数据在比例上并不平衡,并且在促进语言迁移方面表现出不同的效果。为了更好地探索预训练期间代码切换对于语言对齐的作用,我们研究了合成代码切换的策略。我们持续扩大合成代码切换数据的规模,并在基准测试和表示空间中都观察到了显著的改进。大量实验表明,引入合成代码切换数据能够实现更好的语言对齐,并且能很好地泛化到具有不同质量预训练语料库的高资源、中资源和低资源语言。