Composed image retrieval, a task involving the search for a target image using a reference image and a complementary text as the query, has witnessed significant advancements owing to the progress made in cross-modal modeling. Unlike the general image-text retrieval problem with only one alignment relation, i.e., image-text, we argue for the existence of two types of relations in composed image retrieval. The explicit relation pertains to the reference image & complementary text-target image, which is commonly exploited by existing methods. Besides this intuitive relation, the observations during our practice have uncovered another implicit yet crucial relation, i.e., reference image & target image-complementary text, since we found that the complementary text can be inferred by studying the relation between the target image and the reference image. Regrettably, existing methods largely focus on leveraging the explicit relation to learn their networks, while overlooking the implicit relation. In response to this weakness, We propose a new framework for composed image retrieval, termed dual relation alignment, which integrates both explicit and implicit relations to fully exploit the correlations among the triplets. Specifically, we design a vision compositor to fuse reference image and target image at first, then the resulted representation will serve two roles: (1) counterpart for semantic alignment with the complementary text and (2) compensation for the complementary text to boost the explicit relation modeling, thereby implant the implicit relation into the alignment learning. Our method is evaluated on two popular datasets, CIRR and FashionIQ, through extensive experiments. The results confirm the effectiveness of our dual-relation learning in substantially enhancing composed image retrieval performance.
翻译:组合图像检索是一项利用参考图像与补充文本作为查询来搜索目标图像的任务,由于跨模态建模的进展,该任务已取得显著突破。与仅存在单一对齐关系(即图像-文本)的通用图像-文本检索不同,我们认为组合图像检索中存在两种类型的关系。显式关系涉及参考图像与补充文本—目标图像的关联,这已被现有方法普遍利用。除这种直观关系外,实践中的观察揭示了另一种隐含但关键的关系:参考图像与目标图像—补充文本,因为我们发现,通过分析目标图像与参考图像之间的关系可以推断出补充文本。遗憾的是,现有方法主要侧重于利用显式关系进行网络学习,而忽略了隐含关系。针对这一不足,我们提出了一种名为双重关系对齐的组合图像检索新框架,该框架整合了显式与隐含两种关系,以充分挖掘三元组之间的相关性。具体而言,我们首先设计了一个视觉合成器来融合参考图像与目标图像,随后生成的表示将承担两个角色:(1)与补充文本进行语义对齐的对应体;(2)对补充文本进行补偿以增强显式关系建模,从而将隐含关系植入对齐学习过程。我们在两个流行数据集CIRR和FashionIQ上通过大量实验对方法进行了评估,结果证实了双重关系学习在显著提升组合图像检索性能方面的有效性。