This work introduces composed image retrieval to remote sensing. It allows to query a large image archive by image examples alternated by a textual description, enriching the descriptive power over unimodal queries, either visual or textual. Various attributes can be modified by the textual part, such as shape, color, or context. A novel method fusing image-to-image and text-to-image similarity is introduced. We demonstrate that a vision-language model possesses sufficient descriptive power and no further learning step or training data are necessary. We present a new evaluation benchmark focused on color, context, density, existence, quantity, and shape modifications. Our work not only sets the state-of-the-art for this task, but also serves as a foundational step in addressing a gap in the field of remote sensing image retrieval. Code at: https://github.com/billpsomas/rscir
翻译:本研究将组合图像检索引入遥感领域。该方法允许通过交替使用图像示例和文本描述来查询大型图像档案库,从而增强了单模态查询(无论是视觉还是文本)的描述能力。文本部分可以修改各种属性,例如形状、颜色或上下文。我们提出了一种融合图像-图像相似度和文本-图像相似度的新方法。我们证明视觉语言模型具备足够的描述能力,无需额外的学习步骤或训练数据。我们提出了一个专注于颜色、上下文、密度、存在性、数量和形状修改的新评估基准。我们的工作不仅为此任务设定了最新技术水平,还为填补遥感图像检索领域的空白迈出了基础性一步。代码位于:https://github.com/billpsomas/rscir