Mobile device agent based on Multimodal Large Language Models (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobile system metadata, Mobile-Agent allows for greater adaptability across diverse mobile operating environments in a vision-centric way, thereby eliminating the necessity for system-specific customizations. To assess the performance of Mobile-Agent, we introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent. The experimental results indicate that Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. Code and model will be open-sourced at https://github.com/X-PLUG/MobileAgent.
翻译:基于多模态大语言模型的移动设备智能体正成为热门应用。本文介绍Mobile-Agent,一款自主多模态移动设备智能体。该智能体首先利用视觉感知工具精准识别并定位应用前端界面中的视觉与文本元素,随后依据感知到的视觉上下文,自主规划并分解复杂操作任务,通过逐步操作导航移动应用。与以往依赖应用XML文件或移动系统元数据的解决方案不同,Mobile-Agent以视觉为核心,可适应多种移动操作环境,无需针对特定系统进行定制。为评估其性能,我们提出移动操作基准测试Mobile-Eval。基于该基准,我们对Mobile-Agent进行了全面评估。实验结果表明,Mobile-Agent具备卓越的准确率与完成率,即使面对多应用协作等复杂指令,仍能完成需求。代码与模型将开源至https://github.com/X-PLUG/MobileAgent。