Given a query consisting of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption. The reliance of supervised methods on labor-intensive manually labeled datasets hinders their broad applicability. In this work, we introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset. We propose an approach named iSEARLE (improved zero-Shot composEd imAge Retrieval with textuaL invErsion) that involves mapping the visual information of the reference image into a pseudo-word token in CLIP token embedding space and combining it with the relative caption. To foster research on ZS-CIR, we present an open-domain benchmarking dataset named CIRCO (Composed Image Retrieval on Common Objects in context), the first CIR dataset where each query is labeled with multiple ground truths and a semantic categorization. The experimental results illustrate that iSEARLE obtains state-of-the-art performance on three different CIR datasets -- FashionIQ, CIRR, and the proposed CIRCO -- and two additional evaluation settings, namely domain conversion and object composition. The dataset, the code, and the model are publicly available at https://github.com/miccunifi/SEARLE.
翻译:给定由参考图像和相对描述组成的查询,组合图像检索(CIR)旨在检索与参考图像视觉相似且同时体现相对描述中指定变化的目标图像。监督方法对需要人工标注的数据集的依赖限制了其广泛适用性。本文提出一项新任务——零样本组合图像检索(ZS-CIR),该任务无需标注训练数据集即可解决CIR问题。我们提出一种名为iSEARLE(改进的零样本组合图像检索与文本反转方法)的方法,该方法通过将参考图像的视觉信息映射为CLIP词嵌入空间中的伪词标记,并将其与相对描述相结合。为促进ZS-CIR研究,我们提出了一个开放域基准数据集CIRCO(通用上下文对象组合图像检索),这是首个每项查询均标注多个真实结果及语义分类的CIR数据集。实验结果表明,iSEARLE在三个不同CIR数据集(FashionIQ、CIRR及本文提出的CIRCO)以及额外两种评估设置(领域转换和对象组合)上均实现了最先进性能。数据集、代码及模型已公开发布于https://github.com/miccunifi/SEARLE。