TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

Climate and environmental decision-making increasingly requires reasoning across heterogeneous inputs, including gridded physical data, satellite imagery, geospatial context, and simulator outputs. Weather and climate foundation models can forecast well, but do not reason interactively in language, while large language models (LLMs) reason in language but cannot operate directly on high-dimensional Earth-system data. As a result, real scientific workflows in Earth-science remain underserved. We introduce TerraBench, a benchmark for grounded Earth-science reasoning, built on TerraAgent, a ReAct-style executable framework that interleaves reasoning, tool calls, and observations to couple LLM planning with scientific tools for environmental retrieval, geospatial processing, simulation, and artifact-backed computation. TerraBench unifies analysis of Earth observation imagery, gridded data, GIS reasoning and simulation in a single executable interface, whereas prior benchmarks isolate these capabilities into narrow individual tasks. It is also the first in this space to pair process-level tool-use metrics with tolerance-aware numeric scoring. The benchmark comprises 403 extensive agentic tasks across three tracks (Fundamentals, Simulator-Grounded, and Document-Grounded Verification) and eight application domains with 24,500 verified execution steps. These results indicate that reliable Earth-science agents must go beyond tool access to coordinate heterogeneous workflows, parameterize tools precisely, and preserve artifact provenance.

翻译：气候与环境决策日益需要对异构输入进行推理，包括网格化物理数据、卫星图像、地理空间背景及模拟器输出结果。天气与气候基础模型虽能有效预测，但无法以语言形式进行交互式推理；而大语言模型虽能进行语言推理，却无法直接处理高维地球系统数据。因此，地球科学领域真实科学工作流的智能化需求仍未得到满足。我们提出TerraBench——一个基于地球科学推理的基准测试集，其构建于TerraAgent之上，后者采用ReAct风格的可执行框架，通过交织推理、工具调用与观测结果，将大语言模型规划能力与环境检索、地理空间处理、模拟及基于证据的计算等科学工具相结合。与先前局限于单一任务的基准不同，TerraBench将地球观测影像分析、网格化数据处理、地理信息系统推理及模拟统一整合于单一可执行接口中。该基准也是首个在此领域将过程级工具使用指标与容错数值评分相结合的方案。基准包含横跨三个赛道（基础任务、模拟器驱动验证及文档驱动验证）的403项复杂智能体任务，涵盖8个应用领域，包含24,500个经过验证的执行步骤。研究结果表明，可靠的地球科学智能体必须超越单纯工具调用能力，协调异构工作流、精确参数化工具并维护证据溯源。