Cross-lingual knowledge transfer is critical for building high-performing multilingual language models for languages with insufficient training data. When target language data is scarce, the knowledge required for many downstream tasks involving scientific reasoning, commonsense inference, and world knowledge must be acquired primarily from the high-resource language, making effective knowledge transfer essential. Existing methods for improving such cross-lingual knowledge transfer require large amounts of parallel data, translation systems, auxiliary models, or additional training stages that are largely unavailable for many languages. We propose LINK - a data-level intervention method that improves knowledge transfer during model pretraining through lexical substitutions in high-resource part of pretraining data using bilingual vocabularies. For a given replacement ratio, randomly selected words in a portion of the high-resource (English) training corpus are swapped with their word-level translations, requiring no additional model training and only a bilingual vocabulary, which can be obtained at near-zero cost for virtually any language. Evaluation on eight languages across five model sizes shows notable improvements on downstream tasks in the target language, with up to a 2x speedup in training to reach equivalent performance.
翻译:跨语言知识迁移对于在训练数据不足的语言上构建高性能多语言语言模型至关重要。当目标语言数据稀缺时,许多下游任务(涉及科学推理、常识推理和世界知识)所需的知识必须主要从高资源语言中获取,因此有效的知识迁移至关重要。现有改进此类跨语言知识迁移的方法需要大量并行数据、翻译系统、辅助模型或额外的训练阶段,这些资源在许多语言中不可用。我们提出LINK——一种数据级干预方法,通过使用双语词汇对高资源预训练数据中的词汇进行替换,从而在模型预训练过程中提升知识迁移效果。在指定的替换比例下,高资源(英语)训练语料库中部分随机选中的单词将被替换为其单词级翻译,无需额外的模型训练,只需要一个双语词汇表——对于几乎所有语言,该词汇表都可以以接近零成本获取。对五种模型规模下八种语言的评估表明,目标语言的下游任务性能显著提升,达到同等性能的训练速度最高可提升2倍。