Cross-lingual cross-modal retrieval has garnered increasing attention recently, which aims to achieve the alignment between vision and target language (V-T) without using any annotated V-T data pairs. Current methods employ machine translation (MT) to construct pseudo-parallel data pairs, which are then used to learn a multi-lingual and multi-modal embedding space that aligns visual and target-language representations. However, the large heterogeneous gap between vision and text, along with the noise present in target language translations, poses significant challenges in effectively aligning their representations. To address these challenges, we propose a general framework, Cross-Lingual to Cross-Modal (CL2CM), which improves the alignment between vision and target language using cross-lingual transfer. This approach allows us to fully leverage the merits of multi-lingual pre-trained models (e.g., mBERT) and the benefits of the same modality structure, i.e., smaller gap, to provide reliable and comprehensive semantic correspondence (knowledge) for the cross-modal network. We evaluate our proposed approach on two multilingual image-text datasets, Multi30K and MSCOCO, and one video-text dataset, VATEX. The results clearly demonstrate the effectiveness of our proposed method and its high potential for large-scale retrieval.
翻译:跨语言跨模态检索近年来受到越来越多的关注,其目标是在不使用任何标注的视觉-目标语言(V-T)数据对的情况下,实现视觉与目标语言之间的对齐。当前的方法利用机器翻译(MT)构建伪平行数据对,随后用于学习一个多语言多模态嵌入空间,以对齐视觉表示与目标语言表示。然而,视觉与文本之间的巨大异质性差距,以及目标语言翻译中存在的噪声,给有效对齐它们的表示带来了显著挑战。为解决这些挑战,我们提出一个通用框架——跨语言到跨模态(CL2CM),通过跨语言迁移改进视觉与目标语言之间的对齐。该方法使我们能够充分利用多语言预训练模型(例如mBERT)的优势以及同一模态结构(即更小的差距)带来的好处,为跨模态网络提供可靠且全面的语义对应关系(知识)。我们在两个多语言图像-文本数据集Multi30K和MSCOCO,以及一个视频-文本数据集VATEX上评估了所提出的方法。结果清楚表明了所提方法的有效性及其在大规模检索中的巨大潜力。