Foundation Models (FMs) have achieved state-of-the-art performance across domains by leveraging large-scale pretraining. In Earth Observation (EO), the availability of petabyte-scale satellite archives has recently enabled the development of GeoSpatial Foundation Models (GFMs). Yet, fundamental questions remain regarding how dataset size, model architecture, and size interact to determine downstream performance. In this work, we systematically explore this design space by pretraining and fine-tuning models on three dataset scales: PhilEO Globe (0.5TB), FastTOM (2TB, introduced here), and MajorTOM (23TB). We evaluate three architectural families: Geo-Aware U-Net (CNN), ViT-UPerNet (Transformer), and Mamba (State-Space Model); across model sizes ranging from 44M to 300M parameters. All models are benchmarked on the PhilEO Bench, covering: road density and building density regression, and land cover segmentation, and are compared against existing GFMs such as TerraMind and Prithvi-EO-2.0. Our results show that CNN-based models remain highly competitive in low-shot settings, with a 200M-parameter Geo-Aware U-Net outperforming larger architectures on regression tasks. However, when scaling to multi-terabyte datasets, ViT-UPerNet achieves the best performance, particularly for semantic segmentation on MajorTOM (23TB). Finally, we provide the first extensive evaluation of Mamba models in EO, highlighting their potential efficiency advantages, though further large-scale pretraining is required to fully match CNNs and ViTs. All code, pretrained models, and the FastTOM dataset are released publicly, enabling reproducibility and further exploration of scaling laws for GFMs.
翻译:基础模型通过大规模预训练已在多个领域实现了最先进的性能。在地球观测领域,PB级卫星存档数据的可用性近期催生了地理空间基础模型的发展。然而,关于数据集规模、模型架构与模型尺寸如何共同决定下游性能的基本问题仍未得到解答。本研究通过在三类数据集规模上进行预训练与微调,系统性地探索了这一设计空间:PhilEO Globe(0.5TB)、FastTOM(2TB,本文首次提出)和MajorTOM(23TB)。我们评估了三种架构系列:地理感知U-Net(CNN)、ViT-UPerNet(Transformer)和Mamba(状态空间模型);模型参数量覆盖44M至300M范围。所有模型均在PhilEO Bench上进行基准测试,涵盖道路密度与建筑密度回归、土地覆盖分割任务,并与现有地理空间基础模型(如TerraMind和Prithvi-EO-2.0)进行对比。实验结果表明:在低样本场景下,基于CNN的模型仍具有强大竞争力,参数量为200M的地理感知U-Net在回归任务上优于更大规模的架构;但当扩展至TB级数据集时,ViT-UPerNet在MajorTOM(23TB)上展现出最佳性能,尤其在语义分割任务中。此外,我们首次在地球观测领域对Mamba模型进行了全面评估,揭示了其潜在的效率优势,但需进一步大规模预训练才能完全匹配CNN与ViT的性能。所有代码、预训练模型及FastTOM数据集均已公开发布,以促进地理空间基础模型缩放定律的可复现性研究与深入探索。