Cross-lingual Cross-modal Retrieval (CCR) is an essential task in web search, which aims to break the barriers between modality and language simultaneously and achieves image-text retrieval in the multi-lingual scenario with a single model. In recent years, excellent progress has been made based on cross-lingual cross-modal pre-training; particularly, the methods based on contrastive learning on large-scale data have significantly improved retrieval tasks. However, these methods directly follow the existing pre-training methods in the cross-lingual or cross-modal domain, leading to two problems of inconsistency in CCR: The methods with cross-lingual style suffer from the intra-modal error propagation, resulting in inconsistent recall performance across languages in the whole dataset. The methods with cross-modal style suffer from the inter-modal optimization direction bias, resulting in inconsistent rank across languages within each instance, which cannot be reflected by Recall@K. To solve these problems, we propose a simple but effective 1-to-K contrastive learning method, which treats each language equally and eliminates error propagation and optimization bias. In addition, we propose a new evaluation metric, Mean Rank Variance (MRV), to reflect the rank inconsistency across languages within each instance. Extensive experiments on four CCR datasets show that our method improves both recall rates and MRV with smaller-scale pre-trained data, achieving the new state-of-art.
翻译:跨语言跨模态检索(CCR)是网络搜索中的一项关键任务,其目标在于同时打破模态与语言之间的壁垒,通过单一模型实现多语言场景下的图文检索。近年来,基于跨语言跨模态预训练的研究取得了显著进展;特别是基于大规模数据对比学习的方法,极大地提升了检索任务的性能。然而,这些方法直接沿用了跨语言或跨模态领域现有的预训练范式,导致CCR中存在两类不一致性问题:采用跨语言风格的方法受到模态内误差传播的影响,导致在整个数据集中不同语言的召回性能不一致;采用跨模态风格的方法则面临模态间优化方向偏差的问题,导致每个实例内部不同语言的排序结果不一致,而这一现象无法通过Recall@K指标反映。为解决这些问题,我们提出了一种简单而有效的一对多对比学习方法,该方法平等对待每种语言,从而消除了误差传播与优化偏差。此外,我们提出了一种新的评估指标——平均排序方差(MRV),用以衡量每个实例内部跨语言排序的不一致性。在四个CCR数据集上的大量实验表明,我们的方法在较小规模预训练数据下同时提升了召回率与MRV指标,达到了新的最优性能。