GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI

Naomi Simumba,Nils Lehmann,Paolo Fraccaro,Hamed Alemohammad,Geeth De Mel,Salman Khan,Manil Maskey,Nicolas Longepe,Xiao Xiang Zhu,Hannah Kerner,Juan Bernabe-Moreno,Alexandre Lacoste

Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce ''capability'' groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.

翻译：地理空间基础模型（GeoFMs）正在变革地球观测（EO）领域，但评估工作缺乏标准化协议。GEO-Bench-2通过一个涵盖分类、分割、回归、目标检测和实例分割的综合框架解决了这一问题，该框架覆盖了19个采用宽松许可的数据集。我们引入了“能力”组别，以在具有共同特征（例如分辨率、波段、时间性）的数据集上对模型进行排名。这使得用户能够识别哪些模型在每种能力上表现优异，并确定未来工作中哪些方面需要改进。为支持公平比较和方法创新，我们定义了一个规定性且灵活的评价协议。这不仅确保了基准测试的一致性，也促进了模型适应策略的研究，这是推动GeoFMs用于下游任务的一个关键且开放的挑战。我们的实验表明，没有单一模型能在所有任务中占据主导地位，这证实了架构设计和预训练阶段所做选择的特异性。虽然在自然图像上预训练的模型（ConvNext ImageNet、DINO V3）在高分辨率任务上表现出色，但针对EO的特定模型（TerraMind、Prithvi和Clay）在多光谱应用（如农业和灾害响应）上表现更优。这些发现表明，最优模型的选择取决于任务需求、数据模态和约束条件。这说明，一个在所有任务上都表现良好的单一GeoFM模型的目标，仍然是未来研究的开放课题。GEO-Bench-2支持针对特定用例的、可复现的、信息充分的GeoFM评估。GEO-Bench-2的代码、数据和排行榜已在宽松许可下公开发布。