Cross-modal matching, a fundamental task in bridging vision and language, has recently garnered substantial research interest. Despite the development of numerous methods aimed at quantifying the semantic relatedness between image-text pairs, these methods often fall short of achieving both outstanding performance and high efficiency. In this paper, we propose the crOss-Modal sInkhorn maTching (OMIT) network as an effective solution to effectively improving performance while maintaining efficiency. Rooted in the theoretical foundations of Optimal Transport, OMIT harnesses the capabilities of Cross-modal Mover's Distance to precisely compute the similarity between fine-grained visual and textual fragments, utilizing Sinkhorn iterations for efficient approximation. To further alleviate the issue of redundant alignments, we seamlessly integrate partial matching into OMIT, leveraging local-to-global similarities to eliminate the interference of irrelevant fragments. We conduct extensive evaluations of OMIT on two benchmark image-text retrieval datasets, namely Flickr30K and MS-COCO. The superior performance achieved by OMIT on both datasets unequivocally demonstrates its effectiveness in cross-modal matching. Furthermore, through comprehensive visualization analysis, we elucidate OMIT's inherent tendency towards focal matching, thereby shedding light on its efficacy. Our code is publicly available at https://github.com/ppanzx/OMIT.
翻译:跨模态匹配作为连接视觉与语言的基础任务,近年来获得了广泛的研究关注。尽管已有众多方法致力于量化图像-文本对之间的语义关联性,但这些方法往往难以同时实现卓越的性能与高效的效率。本文提出crOss-Modal sInkhorn maTching(OMIT)网络,作为一种在保持效率的同时有效提升性能的解决方案。该方法植根于最优传输的理论基础,利用跨模态Mover距离精确计算细粒度视觉与文本片段之间的相似性,并通过Sinkhorn迭代实现高效近似。为缓解冗余对齐问题,我们将部分匹配机制无缝整合至OMIT中,借助局部到全局的相似性度量消除无关片段的干扰。我们在Flickr30K和MS-COCO两个基准图像-文本检索数据集上对OMIT进行了全面评估。OMIT在两组数据上取得的优越性能明确证明了其在跨模态匹配任务中的有效性。此外,通过系统的可视化分析,我们阐释了OMIT固有的焦点匹配倾向,从而揭示其效能的内在机制。代码已公开于https://github.com/ppanzx/OMIT。