Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval

Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a ``mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this ``mental image'' is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search the target image. In contrast, we address CIR from first principles by directly generating the ``mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a ``mental image'' for a given multimodal query and propose to use this ``mental image'' to search for the target image. As the ``mental image'' has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm'', where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on four challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.

翻译：组合图像检索（CIR）是一项通过多模态查询从数据库中检索目标图像的任务，该查询由参考图像和修改文本组成。文本指定了如何修改参考图像以形成一种“心理图像”，CIR应基于此心理图像在数据库中查找目标图像。CIR的根本挑战在于这种“心理图像”并非物理存在，仅由查询隐式定义。当前研究致力于零样本方法，使用大型多模态模型（LMM）为给定的多模态查询生成文本描述，然后采用视觉语言模型（VLM）进行文本-视觉匹配以搜索目标图像。与之相反，我们从基本原理出发，通过直接生成“心理图像”以实现更精确的匹配。具体而言，我们提示LMM为给定多模态查询生成“心理图像”，并提议使用此“心理图像”来搜索目标图像。由于“心理图像”与真实图像存在合成到真实的领域差距，我们还为数据库中的每张真实图像生成对应的合成版本以促进匹配。从这个意义上说，我们的方法利用LMM构建了一个“拟想世界”，在其中匹配多模态查询和数据库图像。因此，我们将此方法命名为Paracosm。值得注意的是，Paracosm是一种免训练的零样本CIR方法。它在四个具有挑战性的基准测试中显著优于现有的零样本方法，实现了零样本CIR的最先进性能。