The unprecedented advancements in Multimodal Large Language Models (MLLMs) have demonstrated strong potential in interacting with humans through both language and visual inputs to perform downstream tasks such as visual question answering and scene understanding. However, these models are constrained to basic instruction-following or descriptive tasks, facing challenges in complex real-world remote sensing applications that require specialized tools and knowledge. To address these limitations, we propose RS-Agent, an AI agent designed to interact with human users and autonomously leverage specialized models to address the demands of real-world remote sensing applications. RS-Agent integrates four key components: a Central Controller based on large language models, a dynamic toolkit for tool execution, a Solution Space for task-specific expert guidance, and a Knowledge Space for domain-level reasoning, enabling it to interpret user queries and orchestrate tools for accurate remote sensing task. We introduce two novel mechanisms: Task-Aware Retrieval, which improves tool selection accuracy through expert-guided planning, and DualRAG, a retrieval-augmented generation method that enhances knowledge relevance through weighted, dual-path retrieval. RS-Agent supports flexible integration of new tools and is compatible with both open-source and proprietary LLMs. Extensive experiments across 9 datasets and 18 remote sensing tasks demonstrate that RS-Agent significantly outperforms state-of-the-art MLLMs, achieving over 95% task planning accuracy and delivering superior performance in tasks such as scene classification, object counting, and remote sensing visual question answering. Our work presents RS-Agent as a robust and extensible framework for advancing intelligent automation in remote sensing analysis.
翻译:多模态大语言模型(MLLMs)的前所未有的进展,展示了其通过语言和视觉输入与人类交互以执行视觉问答和场景理解等下游任务的强大潜力。然而,这些模型仅限于基本的指令跟随或描述性任务,在需要专业工具和知识的复杂现实世界遥感应用中面临挑战。为解决这些局限性,我们提出了RS-Agent,一个旨在与人类用户交互并自主利用专业模型以满足现实世界遥感应用需求的AI智能体。RS-Agent集成了四个关键组件:一个基于大语言模型的中央控制器、一个用于工具执行的动态工具包、一个用于任务特定专家指导的解决方案空间,以及一个用于领域级推理的知识空间,使其能够解释用户查询并协调工具以准确执行遥感任务。我们引入了两种新颖机制:任务感知检索,通过专家指导的规划提高工具选择的准确性;以及DualRAG,一种通过加权双路径检索增强知识相关性的检索增强生成方法。RS-Agent支持新工具的灵活集成,并与开源和专有LLMs兼容。在9个数据集和18个遥感任务上进行的大量实验表明,RS-Agent显著优于最先进的多模态大语言模型,实现了超过95%的任务规划准确率,并在场景分类、目标计数和遥感视觉问答等任务中提供了卓越的性能。我们的工作将RS-Agent作为一个稳健且可扩展的框架提出,以推进遥感分析中的智能自动化。