Composed image retrieval, a task involving the search for a target image using a reference image and a complementary text as the query, has witnessed significant advancements owing to the progress made in cross-modal modeling. Unlike the general image-text retrieval problem with only one alignment relation, i.e., image-text, we argue for the existence of two types of relations in composed image retrieval. The explicit relation pertains to the reference image & complementary text-target image, which is commonly exploited by existing methods. Besides this intuitive relation, the observations during our practice have uncovered another implicit yet crucial relation, i.e., reference image & target image-complementary text, since we found that the complementary text can be inferred by studying the relation between the target image and the reference image. Regrettably, existing methods largely focus on leveraging the explicit relation to learn their networks, while overlooking the implicit relation. In response to this weakness, We propose a new framework for composed image retrieval, termed dual relation alignment, which integrates both explicit and implicit relations to fully exploit the correlations among the triplets. Specifically, we design a vision compositor to fuse reference image and target image at first, then the resulted representation will serve two roles: (1) counterpart for semantic alignment with the complementary text and (2) compensation for the complementary text to boost the explicit relation modeling, thereby implant the implicit relation into the alignment learning. Our method is evaluated on two popular datasets, CIRR and FashionIQ, through extensive experiments. The results confirm the effectiveness of our dual-relation learning in substantially enhancing composed image retrieval performance.
翻译:组合图像检索是一项利用参考图像和互补文本作为查询来搜索目标图像的任务,得益于跨模态建模的进展,该领域已取得显著进步。与仅存在单一对齐关系(即图像-文本)的通用图像-文本检索问题不同,我们论证了组合图像检索中存在两种类型的关系。显式关系指参考图像与互补文本-目标图像之间的关联,这是现有方法通常利用的关系。除这种直观关系外,实践中的观察揭示了另一种隐式但至关重要的关系,即参考图像与目标图像-互补文本,因为我们发现互补文本可通过研究目标图像与参考图像之间的关系而推断得出。遗憾的是,现有方法主要侧重于利用显式关系来学习网络,而忽视了隐式关系。针对这一缺陷,我们提出了一种名为双关系对齐的组合图像检索新框架,该框架整合了显式与隐式关系,以充分挖掘三元组间的关联。具体而言,我们首先设计了一个视觉组合器来融合参考图像和目标图像,随后生成的表示将承担两个角色:(1)与互补文本进行语义对齐的对应物;(2)对互补文本的补偿以增强显式关系建模,从而将隐式关系植入对齐学习中。我们在CIRR和FashionIQ两个流行数据集上通过大量实验评估了该方法。结果证实了我们的双关系学习在显著提升组合图像检索性能方面的有效性。