In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion mechanism, to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) when evaluating on the publicly available benckmarks under the zero-shot scenario, i.e., training on the automatically constructed datasets, then directly conduct inference on target downstream datasets, e.g., CIRR and FashionIQ, our proposed approach either performs on par with or significantly outperforms the existing state-of-the-art (SOTA) models. Project page: https://code-kunkun.github.io/ZS-CIR/
翻译:本文研究组合图像检索(CIR)问题,旨在训练一个能够融合多模态信息(如文本与图像)的模型,以精确检索与查询相匹配的图像,扩展用户的表达能力。我们的贡献如下:(i)提出一种可扩展的数据构建流程,通过利用大规模图像-文本对数据集(如LAION-5B的子集)自动构建CIR模型训练数据集;(ii)引入基于Transformer的自适应聚合模型TransAgg,该模型采用简单而高效的融合机制,自适应地整合来自不同模态的信息;(iii)进行广泛的消融实验,以验证所提数据构建流程的实用性和TransAgg核心组件的有效性;(iv)在零样本场景下的公开基准数据集(即:在自动构建的数据集上训练,直接对目标下游数据集如CIRR和FashionIQ进行推理)上进行评估时,所提方法性能与现有最先进(SOTA)模型持平或显著超越。项目页面:https://code-kunkun.github.io/ZS-CIR/