Composed Image Retrieval (CIR) involves searching for target images based on an image-text pair query. While current methods treat this as a query-target matching problem, we argue that CIR triplets contain additional associations beyond this primary relation. In our paper, we identify two new relations within triplets, treating each triplet as a graph node. Firstly, we introduce the concept of text-bridged image alignment, where the query text serves as a bridge between the query image and the target image. We propose a hinge-based cross-attention mechanism to incorporate this relation into network learning. Secondly, we explore complementary text reasoning, considering CIR as a form of cross-modal retrieval where two images compose to reason about complementary text. To integrate these perspectives effectively, we design a twin attention-based compositor. By combining these complementary associations with the explicit query pair-target image relation, we establish a comprehensive set of constraints for CIR. Our framework, CaLa (Complementary Association Learning for Augmenting Composed Image Retrieval), leverages these insights. We evaluate CaLa on CIRR and FashionIQ benchmarks with multiple backbones, demonstrating its superiority in composed image retrieval.
翻译:组合图像检索(CIR)涉及基于图像-文本对查询来搜索目标图像。虽然现有方法将此视为查询-目标匹配问题,但我们认为CIR三元组除了这一主要关系外还包含额外的关联。在本文中,我们识别了三元组内的两种新关系,将每个三元组视为图节点。首先,我们提出了文本桥接的图像对齐概念,其中查询文本作为查询图像与目标图像之间的桥梁。我们提出了一种基于铰链的交叉注意力机制,将这种关系纳入网络学习。其次,我们探索互补文本推理,将CIR视为一种跨模态检索形式,其中两幅图像组合以推理互补文本。为了有效整合这些视角,我们设计了一种基于孪生注意力的组合器。通过将这些互补关联与显式的查询对-目标图像关系相结合,我们为CIR建立了一套全面的约束条件。我们的框架CaLa(用于增强组合图像检索的互补关联学习)利用了这些见解。我们在CIRR和FashionIQ基准测试中使用多个骨干网络评估了CaLa,证明了其在组合图像检索中的优越性。