Image captioning and cross-modal retrieval are examples of tasks that involve the joint analysis of visual and linguistic information. In connection to remote sensing imagery, these tasks can help non-expert users in extracting relevant Earth observation information for a variety of applications. Still, despite some previous efforts, the development and application of vision and language models to the remote sensing domain have been hindered by the relatively small size of the available datasets and models used in previous studies. In this work, we propose RS-CapRet, a Vision and Language method for remote sensing tasks, in particular image captioning and text-image retrieval. We specifically propose to use a highly capable large decoder language model together with image encoders adapted to remote sensing imagery through contrastive language-image pre-training. To bridge together the image encoder and language decoder, we propose training simple linear layers with examples from combining different remote sensing image captioning datasets, keeping the other parameters frozen. RS-CapRet can then generate descriptions for remote sensing images and retrieve images from textual descriptions, achieving SOTA or competitive performance with existing methods. Qualitative results illustrate that RS-CapRet can effectively leverage the pre-trained large language model to describe remote sensing images, retrieve them based on different types of queries, and also show the ability to process interleaved sequences of images and text in a dialogue manner.
翻译:图像描述和跨模态检索是涉及视觉与语言信息联合分析的典型任务。针对遥感影像,这些任务可帮助非专业用户从多种应用中提取相关地球观测信息。然而,尽管已有相关探索,先前研究中使用的数据集和模型规模相对较小,限制了视觉-语言模型在遥感领域的发展与应用。本文提出RS-CapRet,这是一种面向遥感任务(特别是图像描述和文本-图像检索)的视觉与语言方法。我们创新性地采用高性能大型解码器语言模型,结合通过对比语言-图像预训练适配遥感影像的图像编码器。为连接图像编码器与语言解码器,我们提出在融合多个遥感图像描述数据集的样本上训练简易线性层,同时保持其他参数冻结。RS-CapRet能够为遥感图像生成描述文本,并通过文本描述检索图像,在现有方法中达到最先进或具有竞争力的性能。定性结果表明,RS-CapRet可有效利用预训练大型语言模型描述遥感图像,基于不同类型的查询进行检索,还展现出以对话方式处理图像与文本交错序列的能力。