Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks. We attribute this bottleneck to the agents' excessive reliance on static, internal knowledge within MLLMs, which leads to two critical failure points: 1) strategic hallucinations in high-level planning and 2) operational errors during low-level execution on user interfaces (UI). The core insight of this paper is that high-level planning and low-level UI operations require fundamentally distinct types of knowledge. Planning demands high-level, strategy-oriented experiences, whereas operations necessitate low-level, precise instructions closely tied to specific app UIs. Motivated by these insights, we propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation. At the planning stage, we introduce Manager-RAG to reduce strategic hallucinations by retrieving human-validated comprehensive task plans that provide high-level guidance. At the execution stage, we develop Operator-RAG to improve execution accuracy by retrieving the most precise low-level guidance for accurate atomic actions, aligned with the current app and subtask. To accurately deliver these knowledge types, we construct two specialized retrieval-oriented knowledge bases. Furthermore, we introduce Mobile-Eval-RAG, a challenging benchmark for evaluating such agents on realistic multi-app, long-horizon tasks. Extensive experiments demonstrate that Mobile-Agent-RAG significantly outperforms SoTA baselines, improving task completion rate by 11.0% and step efficiency by 10.2%, establishing a robust paradigm for context-aware, reliable multi-agent mobile automation.
翻译:移动智能体展现出巨大潜力,然而当前最先进的智能体在现实世界、长程、跨应用任务上表现出不足的成功率。我们将此瓶颈归因于智能体对多模态大语言模型内部静态知识的过度依赖,这导致了两个关键失败点:1)高层规划中的策略性幻觉,以及2)在用户界面上进行低层执行时的操作错误。本文的核心洞见是,高层规划和低层UI操作需要根本不同类型的知识。规划需要高层、面向策略的经验,而操作则需要与特定应用UI紧密相关的低层、精确指令。受这些洞见启发,我们提出了移动智能体-RAG,一种新颖的分层多智能体框架,创新性地集成了双层检索增强。在规划阶段,我们引入Manager-RAG,通过检索经过人工验证的、提供高层指导的全面任务计划来减少策略性幻觉。在执行阶段,我们开发了Operator-RAG,通过检索与当前应用和子任务对齐的最精确低层指导以执行准确的原子操作,从而提高执行准确性。为了准确传递这些知识类型,我们构建了两个专门的面向检索的知识库。此外,我们引入了Mobile-Eval-RAG,一个用于在现实多应用、长程任务上评估此类智能体的具有挑战性的基准。大量实验表明,移动智能体-RAG显著优于最先进的基线方法,将任务完成率提高了11.0%,步骤效率提高了10.2%,为情境感知、可靠的多智能体移动自动化建立了一个稳健的范式。