Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval

Composed image retrieval (CIR) aims to retrieve the target image based on a multimodal query, i.e., a reference image paired with corresponding modification text. Recent CIR studies leverage vision-language pre-trained (VLP) methods as the feature extraction backbone, and perform nonlinear feature-level multimodal query fusion to retrieve the target image. Despite the promising performance, we argue that their nonlinear feature-level multimodal fusion may lead to the fused feature deviating from the original embedding space, potentially hurting the retrieval performance. To address this issue, in this work, we propose shifting the multimodal fusion from the feature level to the raw-data level to fully exploit the VLP model's multimodal encoding and cross-modal alignment abilities. In particular, we introduce a Dual Query Unification-based Composed Image Retrieval framework (DQU-CIR), whose backbone simply involves a VLP model's image encoder and a text encoder. Specifically, DQU-CIR first employs two training-free query unification components: text-oriented query unification and vision-oriented query unification, to derive a unified textual and visual query based on the raw data of the multimodal query, respectively. The unified textual query is derived by concatenating the modification text with the extracted reference image's textual description, while the unified visual query is created by writing the key modification words onto the reference image. Ultimately, to address diverse search intentions, DQU-CIR linearly combines the features of the two unified queries encoded by the VLP model to retrieve the target image. Extensive experiments on four real-world datasets validate the effectiveness of our proposed method.

翻译：组合图像检索（CIR）旨在基于多模态查询（即参考图像及其对应的修改文本）检索目标图像。近期CIR研究采用视觉-语言预训练（VLP）方法作为特征提取主干，并进行非线性特征级多模态查询融合以检索目标图像。尽管取得了令人瞩目的性能，我们认为这种非线性特征级多模态融合可能导致融合特征偏离原始嵌入空间，从而可能损害检索性能。为解决这一问题，本文提出将多模态融合从特征级迁移至原始数据级，以充分挖掘VLP模型的多模态编码和跨模态对齐能力。具体而言，我们提出一种基于双查询统一的组合图像检索框架（DQU-CIR），其主干仅包含VLP模型的图像编码器和文本编码器。DQU-CIR首先采用两个免训练的查询统一组件：文本导向查询统一与视觉导向查询统一，分别基于多模态查询的原始数据推导出统一的文本查询和视觉查询。其中，统一文本查询通过拼接修改文本与提取的参考图像文本描述获得，统一视觉查询则将关键修改词写入参考图像生成。最终，为应对多样化的检索意图，DQU-CIR通过线性组合VLP模型编码后的两个统一查询特征来检索目标图像。在四个真实场景数据集上的大量实验验证了所提方法的有效性。