Recent progress in VLMs has demonstrated impressive capabilities across a variety of tasks in the natural image domain. Motivated by these advancements, the remote sensing community has begun to adopt VLMs for remote sensing vision-language tasks, including scene understanding, image captioning, and visual question answering. However, existing remote sensing VLMs typically rely on closed-set scene understanding and focus on generic scene descriptions, yet lack the ability to incorporate external knowledge. This limitation hinders their capacity for semantic reasoning over complex or context-dependent queries that involve domain-specific or world knowledge. To address these challenges, we first introduced a multimodal Remote Sensing World Knowledge (RSWK) dataset, which comprises high-resolution satellite imagery and detailed textual descriptions for 14,141 well-known landmarks from 175 countries, integrating both remote sensing domain knowledge and broader world knowledge. Building upon this dataset, we proposed a novel Remote Sensing Retrieval-Augmented Generation (RS-RAG) framework, which consists of two key components. The Multi-Modal Knowledge Vector Database Construction module encodes remote sensing imagery and associated textual knowledge into a unified vector space. The Knowledge Retrieval and Response Generation module retrieves and re-ranks relevant knowledge based on image and/or text queries, and incorporates the retrieved content into a knowledge-augmented prompt to guide the VLM in producing contextually grounded responses. We validated the effectiveness of our approach on three representative vision-language tasks, including image captioning, image classification, and visual question answering, where RS-RAG significantly outperformed state-of-the-art baselines.
翻译:近年来,视觉语言模型在自然图像领域的多种任务中展现出卓越能力。受此进展启发,遥感领域已开始采用视觉语言模型处理遥感视觉语言任务,包括场景理解、图像描述和视觉问答。然而,现有遥感视觉语言模型通常依赖封闭式场景理解,侧重于通用场景描述,缺乏整合外部知识的能力。这一局限阻碍了模型对涉及领域特定知识或世界知识的复杂或上下文相关查询进行语义推理。为应对这些挑战,我们首先构建了多模态遥感世界知识数据集,该数据集包含来自175个国家14,141个知名地标的高分辨率卫星影像及详细文本描述,融合了遥感领域知识与广义世界知识。基于此数据集,我们提出了一种新颖的遥感检索增强生成框架,该框架包含两个核心组件:多模态知识向量数据库构建模块将遥感影像及相关文本知识编码到统一向量空间;知识检索与响应生成模块根据图像和/或文本查询检索并重排序相关知识,并将检索内容整合至知识增强提示中,以引导视觉语言模型生成基于上下文的响应。我们在图像描述、图像分类和视觉问答三项代表性视觉语言任务上验证了方法的有效性,遥感检索增强生成框架显著超越了现有最先进基线模型。