In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion mechanism, to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) when evaluating on the publicly available benckmarks under the zero-shot scenario, i.e., training on the automatically constructed datasets, then directly conduct inference on target downstream datasets, e.g., CIRR and FashionIQ, our proposed approach either performs on par with or significantly outperforms the existing state-of-the-art (SOTA) models. Project page: https://code-kunkun.github.io/ZS-CIR/
翻译:本文研究了组合图像检索(CIR)问题,旨在训练一个能够融合多模态信息(如文本和图像)的模型,以精准检索与查询匹配的图像,从而扩展用户的表达能力。我们的贡献如下:(i)通过利用大规模图像-文本对数据集(如LAION-5B的子集),率先提出一种可扩展的流水线来自动构建用于训练CIR模型的数据集;(ii)引入一种基于Transformer的自适应聚合模型TransAgg,该模型采用简单高效的融合机制,自适应地整合来自不同模态的信息;(iii)进行广泛的消融研究,以验证所提出的数据构建流程的有效性以及TransAgg核心组件的效能;(iv)在零样本场景(即使用自动构建的数据集进行训练,直接对目标下游数据集如CIRR和FashionIQ进行推理)下的公开基准评估中,所提方法性能持平或显著超越现有最先进(SOTA)模型。项目主页:https://code-kunkun.github.io/ZS-CIR/