Geospatial Copilots unlock unprecedented potential for performing Earth Observation (EO) applications through natural language instructions. However, existing agents rely on overly simplified single tasks and template-based prompts, creating a disconnect with real-world scenarios. In this work, we present GeoLLM-Engine, an environment for tool-augmented agents with intricate tasks routinely executed by analysts on remote sensing platforms. We enrich our environment with geospatial API tools, dynamic maps/UIs, and external multimodal knowledge bases to properly gauge an agent's proficiency in interpreting realistic high-level natural language commands and its functional correctness in task completions. By alleviating overheads typically associated with human-in-the-loop benchmark curation, we harness our massively parallel engine across 100 GPT-4-Turbo nodes, scaling to over half a million diverse multi-tool tasks and across 1.1 million satellite images. By moving beyond traditional single-task image-caption paradigms, we investigate state-of-the-art agents and prompting techniques against long-horizon prompts.
翻译:地理空间副驾通过自然语言指令解锁了执行地球观测(EO)应用的巨大潜力。然而,现有智能体依赖于过度简化的单一任务和基于模板的提示,导致与现实场景脱节。在本文中,我们提出GeoLLM-Engine,这是一个为工具增强型智能体设计的、包含遥感平台上分析人员常规执行的复杂任务的环境。我们通过引入地理空间API工具、动态地图/用户界面以及外部多模态知识库来丰富该环境,从而准确评估智能体解析现实高层自然语言指令的能力及其任务完成的功能正确性。通过减轻通常与人工在环基准构建相关的开销,我们在100个GPT-4-Turbo节点上部署大规模并行引擎,扩展至超过50万个多样化多工具任务,并覆盖110万张卫星图像。通过突破传统单任务图像字幕范式,我们针对长时域提示研究了最先进的智能体及提示技术。