Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby inevitably leading to a performance drop. Although some methods attempt to address such noise, they still face two challenging problems: excessive memorizing/overfitting and unreliable correction for NC, especially under high noise. To address the two problems, we propose a generalized Cross-modal Robust Complementary Learning framework (CRCL), which benefits from a novel Active Complementary Loss (ACL) and an efficient Self-refining Correspondence Correction (SCC) to improve the robustness of existing methods. Specifically, ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision, leading to theoretically and experimentally demonstrated robustness against NC. SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences, thereby alleviating error accumulation and achieving accurate and stable corrections. We carry out extensive experiments on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences.
翻译:近期,图像-文本匹配因作为理解视觉与文本模态间潜在对应关系的基础任务,受到学术界与工业界日益广泛的关注。然而,现有方法大多隐式假定训练样本对已充分对齐,忽视了普遍存在的标注噪声(即噪声对应关系,NC),这不可避免地导致性能下降。尽管已有方法尝试处理此类噪声,但仍面临两大挑战:对NC的过度记忆/过拟合以及不可靠的修正问题,尤其在高噪声条件下。为解决上述问题,我们提出一种通用的跨模态鲁棒互补学习框架(CRCL),该框架得益于新型主动互补损失(ACL)和高效的自优化对应修正(SCC),可有效提升现有方法的鲁棒性。具体而言,ACL通过主动学习与互补学习损失降低提供错误监督的风险,从理论和实验上证明了其对NC的鲁棒性;SCC则利用多重自优化过程结合动量修正,扩大对应关系修正的感受野,从而缓解误差累积并实现精准稳定的修正。我们在Flickr30K、MS-COCO和CC152K三个图像-文本基准数据集上开展大量实验,验证了所提CRCL在应对合成噪声与真实世界噪声对应关系时的优越鲁棒性。