Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in a diverse pool of languages can substantially improve multilingual and cross-lingual representations for NLU tasks. We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages and achieve strong cross-lingual alignment through contrastive learning. This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models. Using a multi-way parallel corpus for contrastive training yields substantial gains on bitext mining (21.3%), semantic similarity (5.3%), and classification (28.4%) compared to English-centric (En-X) bilingually parallel data, where X is sampled from a pool of multiple target languages. Furthermore, finetuning mE5 model on a small dataset with multi-way parallelism significantly improves bitext mining compared to one without, underscoring the importance of multi-way cross-lingual supervision even for models already pretrained for high-quality sentence embeddings.
翻译:多语言预训练通常缺乏显式的对齐信号,导致表示空间中的跨语言对齐效果欠佳。本研究证明,使用包含多种语言的多向平行语料库训练标准预训练模型以进行跨语言对齐,能够显著提升自然语言理解任务中的多语言与跨语言表示能力。我们利用现成的神经机器翻译模型将英文文本翻译为六种目标语言,构建了一个多向平行数据集,并通过对比学习实现了强大的跨语言对齐。这使得在针对XLM-Roberta和多语言BERT基础模型的MTEB基准测试中,多种任务在已见语言和未见语言上均取得了显著的性能提升。与以英语为中心的双语平行数据相比,使用多向平行语料库进行对比训练在平行文本挖掘、语义相似性和分类任务上分别实现了21.3%、5.3%和28.4%的显著增益。此外,在具有多向平行性的小型数据集上微调mE5模型,相较于无此特性的数据集,能显著提升平行文本挖掘性能,这凸显了多向跨语言监督对于已预训练获得高质量句子嵌入的模型仍然具有重要作用。