Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain. Owing to the significant diversities between the natural and remote sensing (RS) images, the development of MLLMs in the RS domain is still in the infant stage. To fill the gap, a pioneer MLLM named EarthGPT integrating various multi-sensor RS interpretation tasks uniformly is proposed in this paper for universal RS image comprehension. In EarthGPT, three key techniques are developed including a visual-enhanced perception mechanism, a cross-modal mutual comprehension approach, and a unified instruction tuning method for multi-sensor multi-task in the RS domain. More importantly, a dataset named MMRS-1M featuring large-scale multi-sensor multi-modal RS instruction-following is constructed, comprising over 1M image-text pairs based on 34 existing diverse RS datasets and including multi-sensor images such as optical, synthetic aperture radar (SAR), and infrared. The MMRS-1M dataset addresses the drawback of MLLMs on RS expert knowledge and stimulates the development of MLLMs in the RS domain. Extensive experiments are conducted, demonstrating the EarthGPT's superior performance in various RS visual interpretation tasks compared with the other specialist models and MLLMs, proving the effectiveness of the proposed EarthGPT and offering a versatile paradigm for open-set reasoning tasks.
翻译:多模态大语言模型在自然图像领域的视觉与视觉-语言任务中已取得显著成功。由于自然图像与遥感图像之间存在显著差异,多模态大语言模型在遥感领域的发展仍处于起步阶段。为填补这一空白,本文提出了一种开创性的多模态大语言模型——EarthGPT,该模型统一集成了多种多传感器遥感解译任务,旨在实现通用的遥感图像理解。EarthGPT中开发了三项关键技术:视觉增强感知机制、跨模态互理解方法以及面向遥感领域多传感器多任务的统一指令微调方法。更重要的是,我们构建了名为MMRS-1M的大规模多传感器多模态遥感指令遵循数据集,该数据集基于34个现有多样化遥感数据集,包含超过100万对图像-文本对,涵盖光学、合成孔径雷达和红外等多传感器图像。MMRS-1M数据集弥补了多模态大语言模型在遥感专业知识上的不足,推动了遥感领域多模态大语言模型的发展。大量实验表明,EarthGPT在多种遥感视觉解译任务中均优于其他专业模型和多模态大语言模型,验证了所提EarthGPT的有效性,并为开放式推理任务提供了通用范式。