The relations expressed in user queries are vital for cross-modal information retrieval. Relation-focused cross-modal retrieval aims to retrieve information that corresponds to these relations, enabling effective retrieval across different modalities. Pre-trained networks, such as Contrastive Language-Image Pre-training (CLIP), have gained significant attention and acclaim for their exceptional performance in various cross-modal learning tasks. However, the Vision Transformer (ViT) used in these networks is limited in its ability to focus on image region relations. Specifically, ViT is trained to match images with relevant descriptions at the global level, without considering the alignment between image regions and descriptions. This paper introduces VITR, a novel network that enhances ViT by extracting and reasoning about image region relations based on a local encoder. VITR is comprised of two key components. Firstly, it extends the capabilities of ViT-based cross-modal networks by enabling them to extract and reason with region relations present in images. Secondly, VITR incorporates a fusion module that combines the reasoned results with global knowledge to predict similarity scores between images and descriptions. The proposed VITR network was evaluated through experiments on the tasks of relation-focused cross-modal information retrieval. The results derived from the analysis of the RefCOCOg, CLEVR, and Flickr30K datasets demonstrated that the proposed VITR network consistently outperforms state-of-the-art networks in image-to-text and text-to-image retrieval.
翻译:用户查询中表达的关系对于跨模态信息检索至关重要。关系聚焦的跨模态检索旨在检索与这些关系对应的信息,从而在不同模态间实现有效检索。预训练网络(如对比语言-图像预训练模型(CLIP))因其在多种跨模态学习任务中的卓越表现而备受关注与赞誉。然而,这些网络中使用的视觉Transformer(ViT)在聚焦图像区域关系方面存在局限性。具体而言,ViT被训练用于在全局层面匹配图像与相关描述,而未考虑图像区域与描述之间的对齐。本文提出VITR——一种新型网络,通过基于局部编码器提取和推理图像区域关系来增强ViT。VITR包含两个关键组件:首先,它扩展了基于ViT的跨模态网络的能力,使其能够提取并推理图像中的区域关系;其次,VITR集成融合模块,将推理结果与全局知识相结合,以预测图像与描述之间的相似度得分。通过对关系聚焦跨模态信息检索任务的实验评估,基于RefCOCOg、CLEVR和Flickr30K数据集的分析结果表明,所提出的VITR网络在图像到文本与文本到图像检索任务中始终优于现有最优网络。