Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.
翻译:近年来,多模态推理的进展使得智能体能够解释图像、将其与语言关联,并执行结构化分析任务。然而,将此类能力扩展到遥感领域仍面临挑战,因为模型必须在保持连贯多步逻辑的同时,对空间尺度、地理结构和多光谱指数进行推理。为弥合这一差距,OpenEarthAgent 提出了一个统一框架,用于开发基于卫星影像、自然语言查询和详细推理轨迹进行训练的工具增强型地理空间智能体。该训练流程依赖于对结构化推理轨迹的监督微调,使模型与经过验证的跨领域多步工具交互保持一致。配套数据集包含 14,538 个训练实例和 1,169 个评估实例,训练集中推理步骤超过 10 万步,评估集中推理步骤超过 7 千步。其覆盖城市、环境、灾害和基础设施等领域,并整合了基于 GIS 的操作以及 NDVI、NBR、NDBI 等指数分析。基于显式推理轨迹,所训练的智能体通过跨多样条件的工具驱动地理空间交互,展现出结构化推理、稳定的空间理解和可解释的行为。我们报告了相对于强基线的持续改进,以及与近期开源和闭源模型相比具有竞争力的性能表现。