Environmental scientists spend disproportionate effort on data wrangling rather than analysis, and AI agents that automate geospatial workflows remain unvalidated: no benchmark evaluates agents operating through structured tool calling against real APIs. We introduce the GeoNatureAgent Benchmark, the first benchmark for environmental analysis agents that operate via structured tool calls to a production-style geospatial API. It comprises 93 tasks across 18 categories, covering municipality analysis, multi-turn conversation, spatial reasoning, cross-indicator synthesis, error handling and recovery, ranking, comparison, multilingual understanding, habitat analysis, and task rejection. Tasks are evaluated against an open, self-hostable API serving three environmental indicators across Spain and Portugal via sixteen tools. We evaluate seven LLMs (Claude Sonnet 4, DeepSeek V3.2, GLM-5, Gemini 2.5 Pro, Qwen3-235B, GPT-OSS-120B, Llama 4 Scout) under three temperature-1.0 seeds, reporting capability and per-case cost as orthogonal axes. We find: (1) Claude Sonnet 4 leads at 60.8% +/- 0.8%, followed by DeepSeek V3.2 at 56.3% +/- 3.1%, with no other model above 51%; (2) the cost-accuracy Pareto frontier is occupied mostly by open-weight models, with DeepSeek V3.2 offering 93% of Claude's capability at 11x lower cost ($0.011/case); (3) comparison tasks remain universally unsolved (0% on close-value comparisons), exposing systematic reasoning limits; and (4) structured tool calling against a real API is more discriminative than general-purpose GIS benchmarks, with accuracies 25-35 points lower. We further show extensibility by integrating BigEarthNet V2 land cover for Portugal alongside Spanish CO2 and erosion indicators. The benchmark, harness, and self-hostable API are publicly available.
翻译:环境科学家将大量精力耗费在数据整理而非分析上,能够自动执行地理空间工作流程的AI智能体尚未得到有效验证:目前没有任何基准测试能够评估通过结构化工具调用真实API的智能体。我们提出GeoNatureAgent基准测试,这是首个针对通过结构化工具调用生产级地理空间API进行环境分析的智能体评估基准。该基准包含覆盖18个类别的93项任务,涵盖市政区分析、多轮对话、空间推理、跨指标综合、错误处理与恢复、排序、比较、多语言理解、栖息地分析及任务拒绝。任务通过一个开放的、可自托管的API进行评估,该API提供十六种工具,可查询西班牙和葡萄牙的三种环境指标。我们评估了七种大语言模型(Claude Sonnet 4、DeepSeek V3.2、GLM-5、Gemini 2.5 Pro、Qwen3-235B、GPT-OSS-120B、Llama 4 Scout),采用三种温度1.0的随机种子,并将能力与单次任务成本作为正交指标进行报告。主要发现包括:(1)Claude Sonnet 4以60.8%±0.8%的准确率领先,DeepSeek V3.2以56.3%±3.1%紧随其后,其他模型均低于51%;(2)成本-准确率帕累托前沿主要由开源模型占据,其中DeepSeek V3.2以Claude 11倍更低的成本(0.011美元/次)实现了其93%的能力;(3)比较类任务普遍无法解决(近似值比较任务准确率为0%),暴露了系统性推理缺陷;(4)面向真实API的结构化工具调用比通用GIS基准更具区分度,准确率低25-35个百分点。我们进一步展示了基准的可扩展性:在西班牙CO2和土壤侵蚀指标基础上,集成了葡萄牙的BigEarthNet V2土地覆盖数据。该基准、测试框架及可自托管API均已开源。