RingMo-Agent：面向多平台与多模态推理的统一遥感基础模型 (RingMo-Agent: A Unified Remote Sensing Foundation Model for Multi-Platform and Multi-Modal Reasoning)

Remote sensing (RS) images from multiple modalities and platforms exhibit diverse details due to differences in sensor characteristics and imaging perspectives. Existing vision-language research in RS largely relies on relatively homogeneous data sources. Moreover, they still remain limited to conventional visual perception tasks such as classification or captioning. As a result, these methods fail to serve as a unified and standalone framework capable of effectively handling RS imagery from diverse sources in real-world applications. To address these issues, we propose RingMo-Agent, a model designed to handle multi-modal and multi-platform data that performs perception and reasoning tasks based on user textual instructions. Compared with existing models, RingMo-Agent 1) is supported by a large-scale vision-language dataset named RS-VL3M, comprising over 3 million image-text pairs, spanning optical, SAR, and infrared (IR) modalities collected from both satellite and UAV platforms, covering perception and challenging reasoning tasks; 2) learns modality adaptive representations by incorporating separated embedding layers to construct isolated features for heterogeneous modalities and reduce cross-modal interference; 3) unifies task modeling by introducing task-specific tokens and employing a token-based high-dimensional hidden state decoding mechanism designed for long-horizon spatial tasks. Extensive experiments on various RS vision-language tasks demonstrate that RingMo-Agent not only proves effective in both visual understanding and sophisticated analytical tasks, but also exhibits strong generalizability across different platforms and sensing modalities.

翻译：由于传感器特性与成像视角的差异，来自多平台与多模态的遥感影像呈现出多样化的细节特征。现有遥感领域的视觉-语言研究主要依赖相对同质的数据源，且大多仍局限于分类、描述生成等传统视觉感知任务。因此，这些方法难以在实际应用中作为统一且独立的框架有效处理多源遥感影像。针对上述问题，本文提出RingMo-Agent模型，该模型能够处理多模态、多平台数据，并基于用户文本指令执行感知与推理任务。相较于现有模型，RingMo-Agent具有以下特点：1）依托大规模遥感视觉-语言数据集RS-VL3M进行训练，该数据集包含超过300万对图像-文本样本，涵盖卫星与无人机平台采集的光学、合成孔径雷达与红外模态数据，同时包含感知任务与具有挑战性的推理任务；2）通过引入分离的嵌入层构建异构模态的独立特征表示，减少跨模态干扰，从而学习具有模态适应性的表征；3）通过引入任务特定标记并采用基于标记的高维隐状态解码机制，实现对长时序空间任务的统一建模。在多种遥感视觉-语言任务上的大量实验表明，RingMo-Agent不仅在视觉理解与复杂分析任务中均表现优异，同时在不同平台与传感模态间展现出强大的泛化能力。