Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval

Recently multi-lingual pre-trained language models (PLM) such as mBERT and XLM-R have achieved impressive strides in cross-lingual dense retrieval. Despite its successes, they are general-purpose PLM while the multilingual PLM tailored for cross-lingual retrieval is still unexplored. Motivated by an observation that the sentences in parallel documents are approximately in the same order, which is universal across languages, we propose to model this sequential sentence relation to facilitate cross-lingual representation learning. Specifically, we propose a multilingual PLM called masked sentence model (MSM), which consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. The document encoder is shared for all languages to model the universal sequential sentence relation across languages. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives. Comprehensive experiments on four cross-lingual retrieval tasks show MSM significantly outperforms existing advanced pre-training models, demonstrating the effectiveness and stronger cross-lingual retrieval capabilities of our approach. Code and model will be available.

翻译：近期，多语言预训练语言模型（如mBERT和XLM-R）在跨语言密集检索领域取得了显著进展。尽管成果斐然，但这些模型属于通用型预训练语言模型，而针对跨语言检索专门优化的多语言预训练语言模型仍处于探索阶段。基于平行文档中句子顺序近似一致（这一特性跨语言普遍存在）的发现，我们提出对序列句子关系进行建模以促进跨语言表示学习。具体而言，我们提出名为掩码句子模型（MSM）的多语言预训练语言模型，该模型包含：生成句子表示的句子编码器，以及对文档中句子向量序列进行处理的文档编码器。文档编码器在所有语言间共享，用于建模跨语言的通用序列句子关系。为训练该模型，我们提出掩码句子预测任务，该任务通过带有采样负例的分层对比损失对句子向量进行掩码与预测。在四项跨语言检索任务上的综合实验表明，MSM显著优于现有先进预训练模型，证明了我们方法的有效性及更强的跨语言检索能力。代码与模型将公开提供。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

专知会员服务

97+阅读 · 2020年4月10日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日