RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% [email protected] on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.
翻译:RS-MLLMs实现了对地球观测影像的自然语言理解与空间推理。然而,现有模型仅支持有限类型的传感器与任务,导致对地球的观测视角碎片化,且跨模态地球科学知识远未得到充分利用。本文提出地球-全视(Earth-OneVision),一个2B参数的RS-MLLM,在单一自回归框架内统一了六种传感器模态(即光学、SAR、红外、多光谱、时相与视频)以及九类任务中的跨传感器融合。三种专用机制针对三大瓶颈加以解决:全粒度视觉-语言对齐(FGVLA)将多层级视觉特征与多维语言空间对齐;空间语言同构序列化(SLIS)将异构空间输出统一为自回归令牌;渐进式跨模态适应(PCMA)将复合域差距分解为递进阶段,依次应对视角差异与成像物理差异。为支撑联合训练,构建了MMRS-OneVision数据集,包含约3400万问答对,涵盖全部六种传感器模态及九类任务中的跨传感器融合,规模远超现有遥感多模态指令数据集。凭借仅2B参数,地球-全视在广泛基准测试中取得具有竞争力或最优的结果,稳定匹配或超越4B-72B参数的RS-MLLMs。其在光学视觉定位任务OPT-RSVG测试集上达到87.52%的[email protected]指标,在SAR视觉问答基准SARLANG-Bench上实现80.68%,超越7B模型超过7%。此外,在多光谱分类任务BigEarthNet-MS测试集上达到75.74%的召回率,在跨模态推理任务EarthMind-Bench上实现81.94%的多选题准确率。