Recent progress in VLMs has demonstrated impressive capabilities across a variety of tasks in the natural image domain. Motivated by these advancements, the remote sensing community has begun to adopt VLMs for remote sensing vision-language tasks, including scene understanding, image captioning, and visual question answering. However, existing remote sensing VLMs typically rely on closed-set scene understanding and focus on generic scene descriptions, yet lack the ability to incorporate external knowledge. This limitation hinders their capacity for semantic reasoning over complex or context-dependent queries that involve domain-specific or world knowledge. To address these challenges, we first introduced a multimodal Remote Sensing World Knowledge (RSWK) dataset, which comprises high-resolution satellite imagery and detailed textual descriptions for 14,141 well-known landmarks from 175 countries, integrating both remote sensing domain knowledge and broader world knowledge. Building upon this dataset, we proposed a novel Remote Sensing Retrieval-Augmented Generation (RS-RAG) framework, which consists of two key components. The Multi-Modal Knowledge Vector Database Construction module encodes remote sensing imagery and associated textual knowledge into a unified vector space. The Knowledge Retrieval and Response Generation module retrieves and re-ranks relevant knowledge based on image and/or text queries, and incorporates the retrieved content into a knowledge-augmented prompt to guide the VLM in producing contextually grounded responses. We validated the effectiveness of our approach on three representative vision-language tasks, including image captioning, image classification, and visual question answering, where RS-RAG significantly outperformed state-of-the-art baselines.
翻译:近年来,视觉语言模型(VLM)在自然图像领域的多种任务中展现出令人印象深刻的能力。受此进展启发,遥感领域已开始采用VLM处理遥感视觉语言任务,包括场景理解、图像描述和视觉问答。然而,现有的遥感VLM通常依赖于封闭式场景理解,侧重于通用场景描述,缺乏整合外部知识的能力。这一局限阻碍了它们对涉及领域特定知识或世界知识的复杂或上下文相关查询进行语义推理的能力。为应对这些挑战,我们首先引入了一个多模态遥感世界知识(RSWK)数据集,该数据集包含来自175个国家14,141个著名地标的高分辨率卫星影像及详细文本描述,整合了遥感领域知识和更广泛的世界知识。基于此数据集,我们提出了一种新颖的遥感检索增强生成(RS-RAG)框架,该框架包含两个关键组件。多模态知识向量数据库构建模块将遥感影像及相关文本知识编码到统一的向量空间中。知识检索与响应生成模块根据图像和/或文本查询检索并重排序相关知识,并将检索到的内容整合到知识增强提示中,以引导VLM生成基于上下文的响应。我们在三个代表性的视觉语言任务(包括图像描述、图像分类和视觉问答)上验证了所提方法的有效性,RS-RAG在这些任务上显著优于现有最先进的基线模型。