Recently multi-lingual pre-trained language models (PLM) such as mBERT and XLM-R have achieved impressive strides in cross-lingual dense retrieval. Despite its successes, they are general-purpose PLM while the multilingual PLM tailored for cross-lingual retrieval is still unexplored. Motivated by an observation that the sentences in parallel documents are approximately in the same order, which is universal across languages, we propose to model this sequential sentence relation to facilitate cross-lingual representation learning. Specifically, we propose a multilingual PLM called masked sentence model (MSM), which consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. The document encoder is shared for all languages to model the universal sequential sentence relation across languages. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives. Comprehensive experiments on four cross-lingual retrieval tasks show MSM significantly outperforms existing advanced pre-training models, demonstrating the effectiveness and stronger cross-lingual retrieval capabilities of our approach. Code and model will be available.
翻译:近期,多语言预训练语言模型(如mBERT和XLM-R)在跨语言密集检索领域取得了显著进展。尽管成果斐然,但这些模型属于通用型预训练语言模型,而针对跨语言检索专门优化的多语言预训练语言模型仍处于探索阶段。基于平行文档中句子顺序近似一致(这一特性跨语言普遍存在)的发现,我们提出对序列句子关系进行建模以促进跨语言表示学习。具体而言,我们提出名为掩码句子模型(MSM)的多语言预训练语言模型,该模型包含:生成句子表示的句子编码器,以及对文档中句子向量序列进行处理的文档编码器。文档编码器在所有语言间共享,用于建模跨语言的通用序列句子关系。为训练该模型,我们提出掩码句子预测任务,该任务通过带有采样负例的分层对比损失对句子向量进行掩码与预测。在四项跨语言检索任务上的综合实验表明,MSM显著优于现有先进预训练模型,证明了我们方法的有效性及更强的跨语言检索能力。代码与模型将公开提供。