Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% [email protected] on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.

翻译：RS-MLLMs实现了对地球观测影像的自然语言理解与空间推理。然而，现有模型仅支持有限类型的传感器与任务，导致对地球的观测视角碎片化，且跨模态地球科学知识远未得到充分利用。本文提出地球-全视（Earth-OneVision），一个2B参数的RS-MLLM，在单一自回归框架内统一了六种传感器模态（即光学、SAR、红外、多光谱、时相与视频）以及九类任务中的跨传感器融合。三种专用机制针对三大瓶颈加以解决：全粒度视觉-语言对齐（FGVLA）将多层级视觉特征与多维语言空间对齐；空间语言同构序列化（SLIS）将异构空间输出统一为自回归令牌；渐进式跨模态适应（PCMA）将复合域差距分解为递进阶段，依次应对视角差异与成像物理差异。为支撑联合训练，构建了MMRS-OneVision数据集，包含约3400万问答对，涵盖全部六种传感器模态及九类任务中的跨传感器融合，规模远超现有遥感多模态指令数据集。凭借仅2B参数，地球-全视在广泛基准测试中取得具有竞争力或最优的结果，稳定匹配或超越4B-72B参数的RS-MLLMs。其在光学视觉定位任务OPT-RSVG测试集上达到87.52%的[email protected]指标，在SAR视觉问答基准SARLANG-Bench上实现80.68%，超越7B模型超过7%。此外，在多光谱分类任务BigEarthNet-MS测试集上达到75.74%的召回率，在跨模态推理任务EarthMind-Bench上实现81.94%的多选题准确率。