EarthMarker: A Visual Prompting Multi-modal Large Language Model for Remote Sensing

Recent advances in prompt learning have allowed users to interact with artificial intelligence (AI) tools in multi-turn dialogue, enabling an interactive understanding of images. However, it is difficult and inefficient to deliver information in complicated remote sensing (RS) scenarios using plain language instructions alone, which would severely hinder deep comprehension of the latent content in imagery. Besides, existing prompting strategies in natural scenes are hard to apply to interpret the RS data due to significant domain differences. To address these challenges, the first visual prompting-based multi-modal large language model (MLLM) named EarthMarker is proposed in the RS domain. EarthMarker is capable of interpreting RS imagery at the image, region, and point levels by levering visual prompts (i.e., boxes and points). Specifically, a shared visual encoding method is developed to establish the spatial pattern interpretation relationships between the multi-scale representations of input images and various visual prompts. Subsequently, the mixed visual-spatial representations are associated with language instructions to construct joint prompts, enabling the interpretation of intricate content of RS imagery. Furthermore, to bridge the domain gap between natural and RS data, and effectively transfer domain-level knowledge from natural scenes to the RS domain, a cross-domain learning strategy is developed to facilitate the RS imagery understanding. In addition, to tackle the lack of RS visual prompting data, a dataset named RSVP featuring multi-modal multi-granularity visual prompts instruction-following is constructed. Our code and dataset are available at https://github.com/wivizhang/EarthMarker.

翻译：近年来，提示学习的进展使得用户能够通过多轮对话与人工智能工具交互，实现对图像的交互式理解。然而，在复杂的遥感场景中，仅使用自然语言指令来传递信息既困难又低效，这会严重阻碍对图像潜在内容的深度理解。此外，由于显著的领域差异，现有自然场景中的提示策略难以应用于遥感数据的解释。为应对这些挑战，本文在遥感领域提出了首个基于视觉提示的多模态大语言模型——EarthMarker。EarthMarker能够通过利用视觉提示（即边界框和点）在图像、区域和点三个层级上解释遥感影像。具体而言，本研究开发了一种共享视觉编码方法，以建立输入图像的多尺度表征与各类视觉提示之间的空间模式解释关系。随后，混合的视觉-空间表征与语言指令相关联，构建联合提示，从而实现对遥感影像复杂内容的解释。此外，为弥合自然场景数据与遥感数据之间的领域差距，并有效将领域级知识从自然场景迁移至遥感领域，本研究提出了一种跨领域学习策略以促进遥感影像理解。同时，针对遥感视觉提示数据的缺乏，本文构建了一个名为RSVP的数据集，该数据集具备多模态、多粒度的视觉提示指令跟随特性。我们的代码与数据集公开于 https://github.com/wivizhang/EarthMarker。