Collecting well-matched multimedia datasets is crucial for training cross-modal retrieval models. However, in real-world scenarios, massive multimodal data are harvested from the Internet, which inevitably contains Partially Mismatched Pairs (PMPs). Undoubtedly, such semantical irrelevant data will remarkably harm the cross-modal retrieval performance. Previous efforts tend to mitigate this problem by estimating a soft correspondence to down-weight the contribution of PMPs. In this paper, we aim to address this challenge from a new perspective: the potential semantic similarity among unpaired samples makes it possible to excavate useful knowledge from mismatched pairs. To achieve this, we propose L2RM, a general framework based on Optimal Transport (OT) that learns to rematch mismatched pairs. In detail, L2RM aims to generate refined alignments by seeking a minimal-cost transport plan across different modalities. To formalize the rematching idea in OT, first, we propose a self-supervised cost function that automatically learns from explicit similarity-cost mapping relation. Second, we present to model a partial OT problem while restricting the transport among false positives to further boost refined alignments. Extensive experiments on three benchmarks demonstrate our L2RM significantly improves the robustness against PMPs for existing models. The code is available at https://github.com/hhc1997/L2RM.
翻译:收集良好匹配的多媒体数据集对于训练跨模态检索模型至关重要。然而,在现实场景中,海量多模态数据从互联网上获取,其中不可避免地包含部分失配对(PMPs)。毫无疑问,这类语义无关的数据会显著损害跨模态检索性能。以往的研究倾向于通过估计软对应关系来降低PMPs的贡献,以缓解该问题。本文从新的视角应对这一挑战:未配对样本间潜在的语义相似性使得从失配对中挖掘有用知识成为可能。为此,我们提出L2RM,一种基于最优传输(OT)的通用框架,用于学习重匹配失配对。具体而言,L2RM旨在通过寻求跨模态的最小代价传输方案来生成精炼对齐。为在OT中形式化重匹配思想,我们首先提出一种自监督代价函数,该函数从显式的相似性-代价映射关系中自动学习。其次,我们提出建模部分OT问题,同时限制假阳性之间的传输以进一步提升精炼对齐。在三个基准上的大量实验表明,L2RM显著提升了现有模型对PMPs的鲁棒性。代码已开源至https://github.com/hhc1997/L2RM。