Existing benchmarks for multimodal learning in Earth science offer limited, siloed coverage of Earth's spheres and their cross-sphere interactions, typically restricting evaluation to the human-activity sphere of atmosphere and to at most 16 tasks. These limitations: narrow-source heterogeneity (single/few data sources), constrained scientific granularity, and limited-sphere extensibility. Therefore, we introduce OmniEarth-Bench, the first multimodal benchmark that systematically spans all six spheres: atmosphere, lithosphere, oceanosphere, cryosphere, biosphere, and human-activity sphere, and cross-spheres. Built with a scalable, modular-topology data inference framework and native multi-observation sources and expert-in-the-loop curation, OmniEarth-Bench produces 29,855 standardized, expert-curated annotations. All annotations are organized into a four-level hierarchy (Sphere, Scenario, Ability, Task), encompassing 109 expert-curated evaluation tasks. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35% accuracy, revealing systematic gaps in Earth-system cognitive ability. The dataset and evaluation code were released at OmniEarth-Bench (https://anonymous.4open.science/r/OmniEarth-Bench-B1BD).
翻译:现有的地球科学多模态学习基准对地球各圈层及其跨圈层相互作用的覆盖范围有限且相互孤立,通常将评估限制在人类活动圈层(大气圈)以及最多16个任务。这些局限性体现在:数据源异质性狭窄(单一或少数数据源)、科学粒度受限以及圈层可扩展性不足。为此,我们提出了OmniEarth-Bench,这是首个系统覆盖全部六大圈层——大气圈、岩石圈、水圈、冰冻圈、生物圈、人类活动圈——及其跨圈层相互作用的多模态基准。该基准基于一个可扩展的模块化拓扑数据推断框架构建,并整合了原生多观测数据源与专家参与循环的标注流程,共产生了29,855个标准化的、经专家审核的标注。所有标注按四级层次结构(圈层、场景、能力、任务)组织,涵盖了109个由专家设计的评估任务。在9个前沿多模态大语言模型上的实验表明,即使是最先进的模型在我们的基准上也表现不佳,没有任何模型达到35%的准确率,这揭示了现有模型在地球系统认知能力方面存在系统性差距。数据集与评估代码已发布于OmniEarth-Bench(https://anonymous.4open.science/r/OmniEarth-Bench-B1BD)。