Current mobile assistants are limited by dependence on system APIs or struggle with complex user instructions and diverse interfaces due to restricted comprehension and decision-making abilities. To address these challenges, we propose MobA, a novel Mobile phone Agent powered by multimodal large language models that enhances comprehension and planning capabilities through a sophisticated two-level agent architecture. The high-level Global Agent (GA) is responsible for understanding user commands, tracking history memories, and planning tasks. The low-level Local Agent (LA) predicts detailed actions in the form of function calls, guided by sub-tasks and memory from the GA. Integrating a Reflection Module allows for efficient task completion and enables the system to handle previously unseen complex tasks. MobA demonstrates significant improvements in task execution efficiency and completion rate in real-life evaluations, underscoring the potential of MLLM-empowered mobile assistants.
翻译:当前移动助手受限于对系统API的依赖,或因理解与决策能力不足而难以处理复杂的用户指令和多样化的界面。为应对这些挑战,我们提出MobA——一种由多模态大语言模型驱动的新型手机智能体,其通过精密的双层智能体架构增强了理解与规划能力。高层全局智能体负责理解用户指令、追踪历史记忆并规划任务;低层局部智能体在全局智能体提供的子任务与记忆引导下,以函数调用的形式预测具体操作。通过集成反思模块,系统能够高效完成任务并处理先前未见的复杂任务。在实际场景评估中,MobA在任务执行效率与完成率方面均展现出显著提升,彰显了MLLM赋能移动助手的巨大潜力。