RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding

The remote sensing image intelligence understanding model is undergoing a new profound paradigm shift which has been promoted by multi-modal large language model (MLLM), i.e. from the paradigm learning a domain model (LaDM) shifts to paradigm learning a pre-trained general foundation model followed by an adaptive domain model (LaGD). Under the new LaGD paradigm, the old datasets, which have led to advances in RSI intelligence understanding in the last decade, are no longer suitable for fire-new tasks. We argued that a new dataset must be designed to lighten tasks with the following features: 1) Generalization: training model to learn shared knowledge among tasks and to adapt to different tasks; 2) Understanding complex scenes: training model to understand the fine-grained attribute of the objects of interest, and to be able to describe the scene with natural language; 3) Reasoning: training model to be able to realize high-level visual reasoning. In this paper, we designed a high-quality, diversified, and unified multimodal instruction-following dataset for RSI understanding produced by GPT-4V and existing datasets, which we called RS-GPT4V. To achieve generalization, we used a (Question, Answer) which was deduced from GPT-4V via instruction-following to unify the tasks such as captioning and localization; To achieve complex scene, we proposed a hierarchical instruction description with local strategy in which the fine-grained attributes of the objects and their spatial relationships are described and global strategy in which all the local information are integrated to yield detailed instruction descript; To achieve reasoning, we designed multiple-turn QA pair to provide the reasoning ability for a model. The empirical results show that the fine-tuned MLLMs by RS-GPT4V can describe fine-grained information. The dataset is available at: https://github.com/GeoX-Lab/RS-GPT4V.

翻译：遥感图像智能理解模型正经历一场由多模态大语言模型推动的深刻范式转变，即从“学习领域模型”范式转向“学习预训练通用基础模型后接自适应领域模型”范式。在这一新范式下，过去十年推动遥感图像智能理解发展的旧数据集已不再适用于全新的任务。我们认为，必须设计新型数据集以支撑具备以下特征的任务：1）泛化性：训练模型学习任务间的共享知识并适应不同任务；2）复杂场景理解：训练模型理解感兴趣目标的细粒度属性，并能用自然语言描述场景；3）推理能力：训练模型实现高层视觉推理。本文利用GPT-4V与现有数据集，构建了一个高质量、多样化、统一的遥感图像理解多模态指令跟随数据集，命名为RS-GPT4V。为实现泛化性，我们采用通过指令跟随从GPT-4V推导出的（问题，答案）对来统一图像描述、定位等任务；为实现复杂场景理解，我们提出了包含局部策略（描述目标细粒度属性及其空间关系）与全局策略（整合所有局部信息生成详细指令描述）的分层指令描述方法；为实现推理能力，我们设计了多轮问答对以赋予模型推理能力。实验结果表明，经RS-GPT4V微调的多模态大语言模型能够描述细粒度信息。数据集发布于：https://github.com/GeoX-Lab/RS-GPT4V。