This paper proposes a novel diffusion-based model, CompoDiff, for solving Composed Image Retrieval (CIR) with latent diffusion and presents a newly created dataset, named SynthTriplets18M, of 18 million reference images, conditions, and corresponding target image triplets to train the model. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR approaches, such as poor generalizability due to the small dataset scale and the limited types of conditions. CompoDiff not only achieves a new zero-shot state-of-the-art on four CIR benchmarks, including FashionIQ, CIRR, CIRCO, and GeneCIS, but also enables a more versatile and controllable CIR by accepting various conditions, such as negative text and image mask conditions, and the controllability to the importance between multiple queries or the trade-off between inference speed and the performance which are unavailable with existing CIR methods. The code and dataset are available at https://github.com/navervision/CompoDiff
翻译:本文提出了一种新颖的扩散模型CompoDiff,用于通过潜在扩散解决组合图像检索(CIR)问题,并创建了一个名为SynthTriplets18M的新数据集,包含1800万张参考图像、条件及对应的目标图像三元组用于模型训练。CompoDiff与SynthTriplets18M解决了以往CIR方法的不足,例如因数据集规模小和条件类型有限导致的泛化能力差。CompoDiff不仅在四个CIR基准测试(包括FashionIQ、CIRR、CIRCO和GeneCIS)上取得了新的零样本最先进性能,而且通过接受多种条件(如负文本和图像掩码条件),以及实现对多个查询间重要性或推理速度与性能之间权衡的可控性,支持更通用、更可控的CIR,这些能力是现有CIR方法所不具备的。代码和数据集可在https://github.com/navervision/CompoDiff获取。