Composed image retrieval, a task involving the search for a target image using a reference image and a complementary text as the query, has witnessed significant advancements owing to the progress made in cross-modal modeling. Unlike the general image-text retrieval problem with only one alignment relation, i.e., image-text, we argue for the existence of two types of relations in composed image retrieval. The explicit relation pertains to the reference image & complementary text-target image, which is commonly exploited by existing methods. Besides this intuitive relation, the observations during our practice have uncovered another implicit yet crucial relation, i.e., reference image & target image-complementary text, since we found that the complementary text can be inferred by studying the relation between the target image and the reference image. Regrettably, existing methods largely focus on leveraging the explicit relation to learn their networks, while overlooking the implicit relation. In response to this weakness, We propose a new framework for composed image retrieval, termed dual relation alignment, which integrates both explicit and implicit relations to fully exploit the correlations among the triplets. Specifically, we design a vision compositor to fuse reference image and target image at first, then the resulted representation will serve two roles: (1) counterpart for semantic alignment with the complementary text and (2) compensation for the complementary text to boost the explicit relation modeling, thereby implant the implicit relation into the alignment learning. Our method is evaluated on two popular datasets, CIRR and FashionIQ, through extensive experiments. The results confirm the effectiveness of our dual-relation learning in substantially enhancing composed image retrieval performance.
翻译:组合图像检索是一项利用参考图像和补充文本作为查询来搜索目标图像的任务,得益于跨模态建模的进展,该任务已取得显著进步。与仅存在单一对齐关系(即图像-文本)的通用图像-文本检索问题不同,我们认为组合图像检索中涉及两种类型的关系。显式关系指参考图像与补充文本-目标图像之间的关联,这是现有方法普遍利用的关系。除这种直观关系外,我们在实践中的观察揭示了另一个隐含但关键的关系,即参考图像与目标图像-补充文本的关系——因为我们发现通过分析目标图像与参考图像之间的关系可以推断出补充文本。遗憾的是,现有方法主要侧重于利用显式关系进行网络学习,而忽略了隐含关系。针对这一不足,我们提出了一种称为双重关系对齐的组合图像检索新框架,该框架整合了显式与隐含关系,以充分挖掘三元组之间的相关性。具体而言,我们首先设计了一个视觉组合器来融合参考图像和目标图像,随后得到的表示将承担两个角色:(1)作为与补充文本进行语义对齐的对应项;(2)作为补充文本的补偿以增强显式关系建模,从而将隐含关系植入对齐学习中。我们在两个流行数据集CIRR和FashionIQ上通过大量实验评估了该方法。结果证实了我们的双重关系学习在显著提升组合图像检索性能方面的有效性。