Benefiting from transformer-based pre-trained language models, neural ranking models have made significant progress. More recently, the advent of multilingual pre-trained language models provides great support for designing neural cross-lingual retrieval models. However, due to unbalanced pre-training data in different languages, multilingual language models have already shown a performance gap between high and low-resource languages in many downstream tasks. And cross-lingual retrieval models built on such pre-trained models can inherit language bias, leading to suboptimal result for low-resource languages. Moreover, unlike the English-to-English retrieval task, where large-scale training collections for document ranking such as MS MARCO are available, the lack of cross-lingual retrieval data for low-resource language makes it more challenging for training cross-lingual retrieval models. In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval. To transfer a model from high to low resource languages, OPTICAL forms the cross-lingual token alignment task as an optimal transport problem to learn from a well-trained monolingual retrieval model. By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training, which is more feasible for low-resource languages. Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages, including neural machine translation.
翻译:受益于基于Transformer的预训练语言模型,神经排序模型已取得显著进展。近年来,多语言预训练语言模型的出现为设计神经跨语言检索模型提供了有力支持。然而,由于不同语言预训练数据的不平衡,多语言模型在许多下游任务中已显示出高资源语言与低资源语言之间的性能差距。基于此类预训练模型构建的跨语言检索模型可能继承语言偏差,导致低资源语言的结果欠佳。此外,与英语到英语检索任务(如MS MARCO等大规模文档排序训练集合可用)不同,低资源语言缺乏跨语言检索数据使得训练跨语言检索模型更具挑战性。本文提出OPTICAL:面向低资源语言跨语言检索的最优传输蒸馏方法。为将模型从高资源语言迁移至低资源语言,OPTICAL将跨语言词对齐任务构建为最优传输问题,以从训练良好的单语检索模型中学习。通过将跨语言知识与查询-文档匹配知识分离,OPTICAL仅需使用双语平行语料进行蒸馏训练,这对低资源语言更为可行。实验结果表明,在极少量训练数据下,OPTICAL在低资源语言上显著优于包括神经机器翻译在内的强基线方法。