RSGPT: A Remote Sensing Vision Language Model and Benchmark

The emergence of large-scale large language models, with GPT-4 as a prominent example, has significantly propelled the rapid advancement of artificial general intelligence and sparked the revolution of Artificial Intelligence 2.0. In the realm of remote sensing (RS), there is a growing interest in developing large vision language models (VLMs) specifically tailored for data analysis in this domain. However, current research predominantly revolves around visual recognition tasks, lacking comprehensive, large-scale image-text datasets that are aligned and suitable for training large VLMs, which poses significant challenges to effectively training such models for RS applications. In computer vision, recent research has demonstrated that fine-tuning large vision language models on small-scale, high-quality datasets can yield impressive performance in visual and language understanding. These results are comparable to state-of-the-art VLMs trained from scratch on massive amounts of data, such as GPT-4. Inspired by this captivating idea, in this work, we build a high-quality Remote Sensing Image Captioning dataset (RSICap) that facilitates the development of large VLMs in the RS field. Unlike previous RS datasets that either employ model-generated captions or short descriptions, RSICap comprises 2,585 human-annotated captions with rich and high-quality information. This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc). To facilitate the evaluation of VLMs in the field of RS, we also provide a benchmark evaluation dataset called RSIEval. This dataset consists of human-annotated captions and visual question-answer pairs, allowing for a comprehensive assessment of VLMs in the context of RS.

翻译：大规模语言模型的兴起，以GPT-4为典型代表，显著推进了通用人工智能的快速发展，并引发了人工智能2.0的革命。在遥感领域，开发专门用于该领域数据分析的大型视觉语言模型日益受到关注。然而，当前研究主要围绕视觉识别任务展开，缺乏全面、大规模且适用于训练大型VLMs的对齐图像-文本数据集，这给有效训练面向遥感应用的此类模型带来了重大挑战。在计算机视觉中，近期研究表明，在小型高质量数据集上微调大型视觉语言模型能在视觉与语言理解方面取得令人印象深刻的性能。这些成果与基于海量数据从头训练的先进VLMs（如GPT-4）不相上下。受这一引人深思的启发，本研究构建了高质量遥感图像描述数据集RSICap，以促进遥感领域大型VLMs的发展。不同于先前采用模型生成描述或短文本的遥感数据集，RSICap包含2,585条人工标注的丰富高质量描述。该数据集为每张图像提供详尽描述，涵盖场景信息（如居民区、机场或农田）以及目标信息（如颜色、形状、数量、绝对位置等）。为便于遥感领域VLMs的评估，我们还提供名为RSIEval的基准评估数据集。该数据集包含人工标注的描述与视觉问答对，可在遥感情境下对VLMs进行全面评估。