Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent works leverage text instructions to allow users to more freely express their search intents. However, they primarily focus on image pairs that are visually similar and/or can be characterized by a small set of pre-defined relations. The core thesis of this paper is that text instructions can enable retrieving images with richer relations beyond visual similarity. To show this, we introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations (e.g., inside view of), and we can bring those implicit relations explicit by synthesizing instructions via foundation models. Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web, MagicLens achieves results comparable with or better than prior best on eight benchmarks of various image retrieval tasks, while maintaining high parameter efficiency with a significantly smaller model size. Additional human analyses on a 1.4M-image unseen corpus further demonstrate the diversity of search intents supported by MagicLens. Code and models are publicly available at https://open-vision-language.github.io/MagicLens/.
翻译:图像检索,即根据参考图像查找所需图像,本质上包含丰富且多方面的搜索意图,这些意图难以仅通过基于图像的度量来捕捉。近期研究利用文本指令使用户能更自由地表达搜索意图。然而,这些方法主要关注视觉上相似和/或可通过少量预定义关系描述的图像对。本文的核心论点是:文本指令能够实现超越视觉相似性的、具有更丰富关系的图像检索。为验证此观点,我们提出了MagicLens系列自监督图像检索模型,该系列支持开放式的文本指令。MagicLens基于一个关键的新颖洞见:同一网页中自然出现的图像对包含广泛的隐含关系(例如“内部视角”),我们可以通过基础模型合成指令将这些隐含关系显式化。通过在从网络挖掘的3670万组具有丰富语义关系的(查询图像、指令、目标图像)三元组上进行训练,MagicLens在八种不同图像检索任务的基准测试中取得了与先前最佳方法相当或更优的结果,同时以显著更小的模型尺寸保持了较高的参数效率。在包含140万张图像的未见语料库上进行的额外人工分析进一步证明了MagicLens所支持的搜索意图的多样性。代码与模型已在https://open-vision-language.github.io/MagicLens/公开。