Relation-focused cross-modal information retrieval focuses on retrieving information based on relations expressed in user queries, and it is particularly important in information retrieval applications and next-generation search engines. While pre-trained networks like Contrastive Language-Image Pre-training (CLIP) have achieved state-of-the-art performance in cross-modal learning tasks, the Vision Transformer (ViT) used in these networks is limited in its ability to focus on image region relations. Specifically, ViT is trained to match images with relevant descriptions at the global level, without considering the alignment between image regions and descriptions. This paper introduces VITR, a novel network that enhances ViT by extracting and reasoning about image region relations based on a Local encoder. VITR comprises two main components: (1) extending the capabilities of ViT-based cross-modal networks to extract and reason with region relations in images; and (2) aggregating the reasoned results with the global knowledge to predict the similarity scores between images and descriptions. Experiments were carried out by applying the proposed network to relation-focused cross-modal information retrieval tasks on the Flickr30K, RefCOCOg, and CLEVR datasets. The results revealed that the proposed VITR network outperformed various other state-of-the-art networks including CLIP, VSE$\infty$, and VSRN++ on both image-to-text and text-to-image cross-modal information retrieval tasks.
翻译:关系聚焦的跨模态信息检索侧重于根据用户查询中表达的关系进行信息检索,在信息检索应用和下一代搜索引擎中尤为重要。尽管对比语言-图像预训练(CLIP)等预训练网络在跨模态学习任务中已达到最先进的性能,但这些网络中使用的视觉Transformer(ViT)在聚焦图像区域关系方面的能力有限。具体而言,ViT被训练为在全局层面将图像与相关描述进行匹配,而未考虑图像区域与描述之间的对齐。本文引入VITR,一种通过基于局部编码器提取和推理图像区域关系来增强ViT的新型网络。VITR包括两个主要组件:(1)扩展基于ViT的跨模态网络的能力,以提取和推理图像中的区域关系;(2)将推理结果与全局知识聚合,以预测图像与描述之间的相似度得分。我们通过将所提网络应用于Flickr30K、RefCOCOg和CLEVR数据集上的关系聚焦跨模态信息检索任务进行了实验。结果表明,所提VITR网络在图像到文本和文本到图像的跨模态信息检索任务上均优于包括CLIP、VSE$\infty$和VSRN++在内的多种其他最先进网络。