Multilingual information retrieval (IR) is challenging since annotated training data is costly to obtain in many languages. We present an effective method to train multilingual IR systems when only English IR training data and some parallel corpora between English and other languages are available. We leverage parallel and non-parallel corpora to improve the pretrained multilingual language models' cross-lingual transfer ability. We design a semantic contrastive loss to align representations of parallel sentences that share the same semantics in different languages, and a new language contrastive loss to leverage parallel sentence pairs to remove language-specific information in sentence representations from non-parallel corpora. When trained on English IR data with these losses and evaluated zero-shot on non-English data, our model demonstrates significant improvement to prior work on retrieval performance, while it requires much less computational effort. We also demonstrate the value of our model for a practical setting when a parallel corpus is only available for a few languages, but a lack of parallel corpora resources persists for many other low-resource languages. Our model can work well even with a small number of parallel sentences, and be used as an add-on module to any backbones and other tasks.
翻译:多语言信息检索(IR)面临挑战,因为在许多语言中获取带注释的训练数据成本高昂。我们提出了一种有效方法,在仅拥有英文IR训练数据及部分英-他语言平行语料时训练多语言IR系统。我们利用平行与非平行语料增强预训练多语言语言模型的跨语言迁移能力。我们设计了语义对比损失函数,用于对齐不同语言中共享相同语义的平行句子表征,同时提出一种新的语言对比损失函数,借助平行句对从非平行语料生成的句子表征中移除语言特定信息。在英文IR数据上使用这些损失函数训练后,在非英语数据上执行零样本评估时,我们的模型在检索性能上较先前工作取得显著提升,且计算开销大幅降低。我们还验证了该模型在实际场景中的价值:当平行语料仅覆盖少数语言,而众多低资源语言仍缺乏平行语料资源时,我们的模型即便仅使用少量平行句子也能良好运行,并可作为即插即用模块应用于任意骨干网络及其他任务。