This work introduces composed image retrieval to remote sensing. It allows to query a large image archive by image examples alternated by a textual description, enriching the descriptive power over unimodal queries, either visual or textual. Various attributes can be modified by the textual part, such as shape, color, or context. A novel method fusing image-to-image and text-to-image similarity is introduced. We demonstrate that a vision-language model possesses sufficient descriptive power and no further learning step or training data are necessary. We present a new evaluation benchmark focused on color, context, density, existence, quantity, and shape modifications. Our work not only sets the state-of-the-art for this task, but also serves as a foundational step in addressing a gap in the field of remote sensing image retrieval. Code at: https://github.com/billpsomas/rscir
翻译:本研究将组合图像检索引入遥感领域。该方法支持通过图像示例与文本描述交替组合的方式查询大规模图像库,从而增强了单一模态查询(无论是视觉还是文本)的描述能力。文本部分可修改多种属性,如形状、颜色或上下文。我们提出了一种融合图像-图像与文本-图像相似度的新方法。实验证明,视觉语言模型具备足够的描述能力,无需额外的学习步骤或训练数据。我们构建了一个新的评估基准,重点关注颜色、上下文、密度、存在性、数量和形状的修改。本工作不仅在此任务上实现了最先进的性能,更为填补遥感图像检索领域的空白奠定了重要基础。代码地址:https://github.com/billpsomas/rscir