Geospatial Foundation Models (GFMs) have emerged as powerful tools for extracting representations from Earth observation data, but their evaluation remains inconsistent and narrow. Existing works often evaluate on suboptimal downstream datasets and tasks, that are often too easy or too narrow, limiting the usefulness of the evaluations to assess the real-world applicability of GFMs. Additionally, there is a distinct lack of diversity in current evaluation protocols, which fail to account for the multiplicity of image resolutions, sensor types, and temporalities, which further complicates the assessment of GFM performance. In particular, most existing benchmarks are geographically biased towards North America and Europe, questioning the global applicability of GFMs. To overcome these challenges, we introduce PANGAEA, a standardized evaluation protocol that covers a diverse set of datasets, tasks, resolutions, sensor modalities, and temporalities. It establishes a robust and widely applicable benchmark for GFMs. We evaluate the most popular GFMs openly available on this benchmark and analyze their performance across several domains. In particular, we compare these models to supervised baselines (e.g. UNet and vanilla ViT), and assess their effectiveness when faced with limited labeled data. Our findings highlight the limitations of GFMs, under different scenarios, showing that they do not consistently outperform supervised models. PANGAEA is designed to be highly extensible, allowing for the seamless inclusion of new datasets, models, and tasks in future research. By releasing the evaluation code and benchmark, we aim to enable other researchers to replicate our experiments and build upon our work, fostering a more principled evaluation protocol for large pre-trained geospatial models. The code is available at https://github.com/VMarsocci/pangaea-bench.
翻译:地理空间基础模型已成为从地球观测数据中提取表征的强大工具,但其评估方法仍存在不一致性和局限性。现有研究通常在次优的下游数据集和任务上进行评估,这些任务往往过于简单或范围狭窄,限制了评估结果在衡量GFM实际适用性方面的效用。此外,当前评估方案明显缺乏多样性,未能涵盖多尺度图像分辨率、传感器类型和时间动态特征,这进一步增加了评估GFM性能的复杂性。特别值得注意的是,现有基准大多存在地理偏见,过度侧重于北美和欧洲地区,这引发了对GFM全球适用性的质疑。为应对这些挑战,我们提出了PANGAEA——一个覆盖多样化数据集、任务类型、分辨率、传感器模态与时间动态的标准化评估框架。该框架为GFM建立了稳健且广泛适用的基准。我们在此基准上评估了当前公开可用的主流GFM,并分析了它们在不同领域中的性能表现。特别地,我们将这些模型与监督式基线模型(如UNet和原始ViT)进行比较,并评估其在标注数据有限情况下的有效性。研究结果揭示了GFM在不同场景下的局限性,表明其性能并未持续优于监督模型。PANGAEA设计具备高度可扩展性,支持未来研究中无缝集成新数据集、模型和任务。通过公开评估代码与基准框架,我们旨在帮助其他研究者复现实验并推进相关工作,从而为大型预训练地理空间模型建立更规范的评估体系。代码已发布于https://github.com/VMarsocci/pangaea-bench。