Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text describing the desired changes. Existing approaches to solving this challenging task learn a mapping from the (reference image, modification text)-pair to an image embedding that is then matched against a large image corpus. One area that has not yet been explored is the reverse direction, which asks the question, what reference image when modified as describe by the text would produce the given target image? In this work we propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures. To encode the bi-directional query we prepend a learnable token to the modification text that designates the direction of the query and then finetune the parameters of the text embedding module. We make no other changes to the network architecture. Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model that itself already achieves state-of-the-art performance.
翻译:组合图像检索旨在基于包含参考图像和描述所需修改的修改文本的多模态用户查询来搜索目标图像。现有解决这一挑战性任务的方法学习将(参考图像、修改文本)对映射到图像嵌入,随后该嵌入与大规模图像语料库进行匹配。目前尚未被探索的一个方向是反向过程,即探究何种参考图像在按照文本描述进行修改后能够生成给定的目标图像。本文提出一种双向训练方案,该方案利用此类反向查询,并适用于现有组合图像检索架构。为编码双向查询,我们在修改文本前添加一个可学习标记以指定查询方向,随后微调文本嵌入模块的参数。此外,我们对网络架构不做任何其他修改。在两个标准数据集上的实验表明,我们的新方法相较于已实现最先进性能的基线BLIP模型,取得了更优的表现。