Multilingual information retrieval (IR) is challenging since annotated training data is costly to obtain in many languages. We present an effective method to train multilingual IR systems when only English IR training data and some parallel corpora between English and other languages are available. We leverage parallel and non-parallel corpora to improve the pretrained multilingual language models' cross-lingual transfer ability. We design a semantic contrastive loss to align representations of parallel sentences that share the same semantics in different languages, and a new language contrastive loss to leverage parallel sentence pairs to remove language-specific information in sentence representations from non-parallel corpora. When trained on English IR data with these losses and evaluated zero-shot on non-English data, our model demonstrates significant improvement to prior work on retrieval performance, while it requires much less computational effort. We also demonstrate the value of our model for a practical setting when a parallel corpus is only available for a few languages, but a lack of parallel corpora resources persists for many other low-resource languages. Our model can work well even with a small number of parallel sentences, and be used as an add-on module to any backbones and other tasks.
翻译:多语言信息检索面临挑战,因为许多语言的标注训练数据获取成本高昂。本文提出一种有效方法,在仅使用英文信息检索训练数据及少量英-他语言平行语料的情况下训练多语言信息检索系统。我们利用平行与非平行语料增强预训练多语言语言模型的跨语言迁移能力。设计语义对比损失函数以对齐不同语言中语义相同的平行句子表征,同时提出新的语言对比损失函数,通过平行句对消除非平行语料中句子表征的语言特异性信息。当在英文信息检索数据上训练并零样本迁移至非英语数据评估时,我们的模型在检索性能上较先前工作有显著提升,且计算开销大幅降低。我们还证明了模型在实战场景中的价值:当平行语料仅覆盖少数语言,而众多低资源语言仍缺乏平行资源时,我们的模型即便使用少量平行句对仍能有效工作,并可作为即插即用模块适用于任意骨干网络及其他任务。