Cross-modal retrieval (CMR) aims to establish interaction between different modalities, among which supervised CMR is emerging due to its flexibility in learning semantic category discrimination. Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data. However, even for unimodal data, precise annotation is expensive and time-consuming, and it becomes more challenging with the multimodal scenario. In practice, massive multimodal data are collected from the Internet with coarse annotation, which inevitably introduces noisy labels. Training with such misleading labels would bring two key challenges -- enforcing the multimodal samples to \emph{align incorrect semantics} and \emph{widen the heterogeneous gap}, resulting in poor retrieval performance. To tackle these challenges, this work proposes UOT-RCL, a Unified framework based on Optimal Transport (OT) for Robust Cross-modal Retrieval. First, we propose a semantic alignment based on partial OT to progressively correct the noisy labels, where a novel cross-modal consistent cost function is designed to blend different modalities and provide precise transport cost. Second, to narrow the discrepancy in multi-modal data, an OT-based relation alignment is proposed to infer the semantic-level cross-modal matching. Both of these two components leverage the inherent correlation among multi-modal data to facilitate effective cost function. The experiments on three widely-used cross-modal retrieval datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches and significantly improves the robustness against noisy labels.
翻译:跨模态检索旨在建立不同模态之间的交互,其中监督式跨模态检索因在学习语义类别判别时的灵活性而日益兴起。尽管现有监督式跨模态检索方法取得了显著性能,其成功很大程度上归功于精心标注的数据。然而,即便是单模态数据,精确标注也成本高昂且耗时,而在多模态场景下这一挑战更为严峻。实际应用中,大量多模态数据从互联网上收集并带有粗糙标注,这不可避免地引入了噪声标签。使用此类误导性标签进行训练会带来两个关键挑战——强制多模态样本"对齐错误语义"并"扩大异构差距",从而导致检索性能下降。为应对这些挑战,本文提出UOT-RCL,一个基于最优传输的鲁棒跨模态检索统一框架。首先,我们提出基于部分最优传输的语义对齐方法逐步纠正噪声标签,其中设计了一种新颖的跨模态一致性代价函数以融合不同模态并提供精确的传输代价。其次,为缩小多模态数据之间的差异,提出基于最优传输的关系对齐方法以推断语义层面的跨模态匹配。这两个组件均利用多模态数据间的内在相关性来促进有效的代价函数。在三个广泛使用的跨模态检索数据集上的实验表明,我们的UOT-RCL超越了现有最先进方法,并显著提升了针对噪声标签的鲁棒性。