Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an intermediate embedding of both. Furthermore, we introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed. TAT closes the modality gap between images and text, making the Slerp process much more effective. Notably, the TAT method is not only efficient in terms of the scale of the training dataset and training time, but it also serves as an excellent initial checkpoint for training supervised CIR models, thereby highlighting its wider potential. The integration of the Slerp-based ZS-CIR with a TAT-tuned model enables our approach to deliver state-of-the-art retrieval performance across CIR benchmarks.
翻译:组合图像检索(CIR)是一项复杂任务,需通过由图像和描述期望修改的文本组成的查询来检索图像。有监督CIR方法虽表现优异,但依赖昂贵的人工标注数据集,限制了其可扩展性和广泛适用性。为解决这些问题,已有研究提出基于伪词标记的零样本CIR(ZS-CIR)方法,该方法通过投影模块将图像映射为词标记。然而,我们推测这种方法存在缺陷:投影模块扭曲了原始图像表征,并将生成的复合嵌入局限于文本侧。为此,我们提出一种新型ZS-CIR方法,采用球面线性插值(Slerp)直接融合图像与文本表征,通过寻找两者的中间嵌入来实现。此外,我们引入文本锚定微调(TAT)方法,在固定文本编码器的同时微调图像编码器。TAT弥合了图像与文本之间的模态差异,使Slerp过程更加高效。值得注意的是,TAT方法不仅在训练数据集规模和训练时间上具有高效性,还能作为有监督CIR模型训练的优异初始检查点,从而凸显其更广泛的应用潜力。将基于Slerp的ZS-CIR与TAT微调模型相结合,使我们的方法在CIR基准测试中达到了当前最优的检索性能。