Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a ``mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this ``mental image'' is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search the target image. In contrast, we address CIR from first principles by directly generating the ``mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a ``mental image'' for a given multimodal query and propose to use this ``mental image'' to search for the target image. As the ``mental image'' has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm'', where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on four challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.
翻译:组合图像检索(CIR)是一项通过多模态查询从数据库中检索目标图像的任务,该查询由参考图像和修改文本组成。文本指定了如何修改参考图像以形成一种“心理图像”,CIR应基于此心理图像在数据库中查找目标图像。CIR的根本挑战在于这种“心理图像”并非物理存在,仅由查询隐式定义。当前研究致力于零样本方法,使用大型多模态模型(LMM)为给定的多模态查询生成文本描述,然后采用视觉语言模型(VLM)进行文本-视觉匹配以搜索目标图像。与之相反,我们从基本原理出发,通过直接生成“心理图像”以实现更精确的匹配。具体而言,我们提示LMM为给定多模态查询生成“心理图像”,并提议使用此“心理图像”来搜索目标图像。由于“心理图像”与真实图像存在合成到真实的领域差距,我们还为数据库中的每张真实图像生成对应的合成版本以促进匹配。从这个意义上说,我们的方法利用LMM构建了一个“拟想世界”,在其中匹配多模态查询和数据库图像。因此,我们将此方法命名为Paracosm。值得注意的是,Paracosm是一种免训练的零样本CIR方法。它在四个具有挑战性的基准测试中显著优于现有的零样本方法,实现了零样本CIR的最先进性能。