Composed Image Retrieval (CIR) involves searching for target images based on an image-text pair query. While current methods treat this as a query-target matching problem, we argue that CIR triplets contain additional associations beyond this primary relation. In our paper, we identify two new relations within triplets, treating each triplet as a graph node. Firstly, we introduce the concept of text-bridged image alignment, where the query text serves as a bridge between the query image and the target image. We propose a hinge-based cross-attention mechanism to incorporate this relation into network learning. Secondly, we explore complementary text reasoning, considering CIR as a form of cross-modal retrieval where two images compose to reason about complementary text. To integrate these perspectives effectively, we design a twin attention-based compositor. By combining these complementary associations with the explicit query pair-target image relation, we establish a comprehensive set of constraints for CIR. Our framework, CaLa (Complementary Association Learning for Augmenting Composed Image Retrieval), leverages these insights. We evaluate CaLa on CIRR and FashionIQ benchmarks with multiple backbones, demonstrating its superiority in composed image retrieval.
翻译:组合图像检索(CIR)涉及基于图像-文本对查询来搜索目标图像。虽然现有方法将其视为查询-目标匹配问题,但我们认为CIR三元组中除了这一主要关系外还存在额外的关联。本文中,我们将每个三元组视为图节点,识别出其中的两种新关系。首先,我们提出了文本桥接的图像对齐概念,即查询文本作为查询图像与目标图像之间的桥梁。我们设计了一种基于铰链的交叉注意力机制,将这一关系融入网络学习。其次,我们探索了互补文本推理,将CIR视为一种跨模态检索形式,其中两幅图像通过组合来推理互补文本。为有效整合这些视角,我们设计了基于孪生注意力的组合器。通过将这些互补关联与显式的查询对-目标图像关系相结合,我们为CIR建立了一套全面的约束体系。我们的框架CaLa(通过互补关联学习增强组合图像检索)正是基于这些洞见构建而成。我们在CIRR和FashionIQ基准测试中使用多种骨干网络对CaLa进行评估,结果证明了其在组合图像检索任务中的优越性。