Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain. Owing to the significant diversities between the natural and remote sensing (RS) images, the development of MLLMs in the RS domain is still in the infant stage. To fill the gap, a pioneer MLLM named EarthGPT integrating various multi-sensor RS interpretation tasks uniformly is proposed in this paper for universal RS image comprehension. In EarthGPT, three key techniques are developed including a visual-enhanced perception mechanism, a cross-modal mutual comprehension approach, and a unified instruction tuning method for multi-sensor multi-task in the RS domain. More importantly, a dataset named MMRS-1M featuring large-scale multi-sensor multi-modal RS instruction-following is constructed, comprising over 1M image-text pairs based on 34 existing diverse RS datasets and including multi-sensor images such as optical, synthetic aperture radar (SAR), and infrared. The MMRS-1M dataset addresses the drawback of MLLMs on RS expert knowledge and stimulates the development of MLLMs in the RS domain. Extensive experiments are conducted, demonstrating the EarthGPT's superior performance in various RS visual interpretation tasks compared with the other specialist models and MLLMs, proving the effectiveness of the proposed EarthGPT and offering a versatile paradigm for open-set reasoning tasks.
翻译:多模态大语言模型在自然图像领域的视觉与视觉-语言任务中已取得显著成功。由于自然图像与遥感图像存在巨大差异,多模态大语言模型在遥感领域的发展仍处于起步阶段。为填补这一空白,本文提出一种名为EarthGPT的先驱性多模态大语言模型,该模型统一集成了多种多传感器遥感解译任务,用于实现通用的遥感图像理解。在EarthGPT中,开发了三项关键技术:视觉增强感知机制、跨模态互理解方法,以及面向遥感领域多传感器多任务的统一指令微调方法。更为重要的是,本研究构建了名为MMRS-1R的大规模多传感器多模态遥感指令跟随数据集,该数据集基于34个现有多样化遥感数据集,包含超过100万组图像-文本对,涵盖光学、合成孔径雷达和红外等多传感器图像。MMRS-1R数据集弥补了多模态大语言模型在遥感专业知识方面的不足,并推动了遥感领域多模态大语言模型的发展。大量实验表明,EarthGPT在多种遥感视觉解译任务中展现出优于其他专家模型和多模态大语言模型的性能,验证了所提出EarthGPT的有效性,并为开放式推理任务提供了通用范式。