The image-text retrieval task aims to retrieve relevant information from a given image or text. The main challenge is to unify multimodal representation and distinguish fine-grained differences across modalities, thereby finding similar contents and filtering irrelevant contents. However, existing methods mainly focus on unified semantic representation and concept alignment for multi-modalities, while the fine-grained differences across modalities have rarely been studied before, making it difficult to solve the information asymmetry problem. In this paper, we propose a novel asymmetry-sensitive contrastive learning method. By generating corresponding positive and negative samples for different asymmetry types, our method can simultaneously ensure fine-grained semantic differentiation and unified semantic representation between multi-modalities. Additionally, a hierarchical cross-modal fusion method is proposed, which integrates global and local-level features through a multimodal attention mechanism to achieve concept alignment. Extensive experiments performed on MSCOCO and Flickr30K, demonstrate the effectiveness and superiority of our proposed method.
翻译:图像-文本检索任务旨在从给定图像或文本中检索相关信息。其主要挑战在于统一多模态表征并区分跨模态间的细粒度差异,从而找到相似内容并过滤无关内容。然而,现有方法主要关注多模态的统一语义表征与概念对齐,鲜有研究跨模态间的细粒度差异,导致难以解决信息不对称问题。本文提出一种新颖的不对称敏感性对比学习方法。通过针对不同类型的不对称性生成对应的正负样本,该方法能同时确保多模态间的细粒度语义区分与统一语义表征。此外,我们提出一种层次化跨模态融合方法,通过多模态注意力机制整合全局与局部层级特征以实现概念对齐。在MSCOCO和Flickr30K数据集上的大量实验表明,所提方法具有有效性与优越性。