Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at https://github.com/ABaldrati/CLIP4Cir
翻译:给定一个由参考图像和相对描述组成的查询,组合图像检索的目标是检索与参考图像视觉相似并整合了描述所表达修改的图像。鉴于近期研究表明大规模视觉语言预训练(VLP)模型在各种任务中的有效性,我们依赖OpenAI CLIP模型的特征来处理该任务。我们首先使用视觉和文本特征的逐元素求和,对两个CLIP编码器进行面向任务的微调。随后在第二阶段,我们训练一个组合器网络,该网络学习结合图像-文本特征以整合双模态信息,并提供用于检索的组合特征。我们在两个训练阶段均使用对比学习。以原始CLIP特征作为基线,实验结果表明,面向任务的微调和精心设计的组合器网络非常有效,在FashionIQ和CIRR这两个流行且具有挑战性的组合图像检索数据集上,其性能优于更复杂的现有方法。代码和预训练模型可在https://github.com/ABaldrati/CLIP4Cir获取。