Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr.
翻译:组合图像检索(CoIR)作为一种同时考虑文本和图像查询以在数据库中搜索相关图像的任务,近年来受到广泛关注。大多数CoIR方法需要包含图像-文本-图像三元组的手动标注数据集,其中文本描述了从查询图像到目标图像的修改过程。然而,手动构建CoIR三元组成本高昂且难以扩展。本工作提出了一种可扩展的自动数据集生成方法,该方法能够基于视频-字幕对生成三元组,同时将任务范围扩展至组合视频检索(CoVR)。为此,我们从大型数据库中挖掘具有相似字幕的配对视频,并利用大语言模型生成相应的修改文本。将此方法应用于大规模的WebVid2M数据集,我们自动构建了包含160万个三元组的WebVid-CoVR数据集。此外,我们提出了一个包含手动标注评估集的CoVR新基准,并提供了基线结果。实验进一步表明,使用本数据集训练的CoVR模型可有效迁移至CoIR任务,在CIRR和FashionIQ基准测试的零样本设置中均实现了最先进的性能提升。我们的代码、数据集和模型已在https://imagine.enpc.fr/~ventural/covr 公开。