Current research on cross-modal retrieval is mostly English-oriented, as the availability of a large number of English-oriented human-labeled vision-language corpora. In order to break the limit of non-English labeled data, cross-lingual cross-modal retrieval (CCR) has attracted increasing attention. Most CCR methods construct pseudo-parallel vision-language corpora via Machine Translation (MT) to achieve cross-lingual transfer. However, the translated sentences from MT are generally imperfect in describing the corresponding visual contents. Improperly assuming the pseudo-parallel data are correctly correlated will make the networks overfit to the noisy correspondence. Therefore, we propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR. In particular, we quantify the confidence of the sample pair correlation with optimal transport theory from both the cross-lingual and cross-modal views, and design dual-view curriculum learning to dynamically model the transportation costs according to the learning stage of the two views. Extensive experiments are conducted on two multilingual image-text datasets and one video-text dataset, and the results demonstrate the effectiveness and robustness of the proposed method. Besides, our proposed method also shows a good expansibility to cross-lingual image-text baselines and a decent generalization on out-of-domain data.
翻译:当前跨模态检索研究主要面向英语,这是由于大规模英语标注的视觉-语言语料库的可用性。为突破非英语标注数据的限制,跨语言跨模态检索(CCR)日益受到关注。多数CCR方法通过机器翻译构建伪平行视觉-语言语料库以实现跨语言迁移。然而,机器翻译生成的句子通常无法完美描述对应的视觉内容。若错误假设伪平行数据具有正确相关性,将导致网络过拟合噪声对应关系。为此,我们提出双视角课程最优传输(DCOT)方法,以在CCR中学习含噪声的对应关系。具体而言,我们从跨语言与跨模态两个视角,利用最优传输理论量化样本对相关性的置信度,并设计双视角课程学习机制,根据两视角的学习阶段动态建模传输成本。在两个多语言图像-文本数据集和一个视频-文本数据集上的广泛实验表明,所提方法具有有效性和鲁棒性。此外,该方法还对跨语言图像-文本基线模型展现出良好的可扩展性,并在域外数据上具备优秀的泛化能力。