Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.
翻译:基础模型通过支持跨多样未标注地理空间模态的可扩展预训练,正迅速改变对地观测领域。然而,从仅编码器到编码器-解码器及掩码自编码范式等架构多样性,使得难以一致评估性能权衡。本研究针对面向地理空间多模态推理的主流基础模型架构开展对比分析,特别关注其在不同光谱波段配置下的灵活性。我们采用相同的自监督学习目标和训练数据集标准化预训练过程,并在GEOBench基准测试中以统一参数化设置评估所有模型在分类与分割任务上的表现。实验结果揭示了模型灵活性、模态对齐与下游任务性能之间的设计权衡新视角。通过对比受控条件下各架构的优势与局限,本研究为构建具备鲁棒多模态推理能力的下一代地理空间基础模型提供了实践指导。