In Composed Image Retrieval (CIR), a user combines a query image with text to describe their intended target. Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image. Labeling such triplets is expensive and hinders broad applicability of CIR. In this work, we propose to study an important task, Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training. To this end, we propose a novel method, called Pic2Word, that requires only weakly labeled image-caption pairs and unlabeled image datasets to train. Unlike existing supervised CIR models, our model trained on weakly labeled or unlabeled datasets shows strong generalization across diverse ZS-CIR tasks, e.g., attribute editing, object composition, and domain conversion. Our approach outperforms several supervised CIR methods on the common CIR benchmark, CIRR and Fashion-IQ. Code will be made publicly available at https://github.com/google-research/composed_image_retrieval.
翻译:在组合图像检索(CIR)中,用户将查询图像与文本描述相结合以表述其预期目标。现有方法依赖使用包含查询图像、文本描述和目标图像的有标注三元组对CIR模型进行监督学习。此类三元组的标注成本高昂,限制了CIR的广泛适用性。本研究提出研究一项重要任务——零样本组合图像检索(ZS-CIR),其目标是在无需标注三元组进行训练的情况下构建CIR模型。为此,我们提出一种名为Pic2Word的新颖方法,该方法仅需弱标注的图像-文本对和无标注图像数据集即可进行训练。与现有监督式CIR模型不同,我们的模型在弱标注或无标注数据集上训练后,在属性编辑、对象组合和域转换等多种ZS-CIR任务中展现出强大的泛化能力。在通用CIR基准测试CIRR和Fashion-IQ上,本方法性能优于多种监督式CIR方法。代码将在https://github.com/google-research/composed_image_retrieval 开源。