The attainment of autonomous operations in mobile computing devices has consistently been a goal of human pursuit. With the development of Large Language Models (LLMs) and Visual Language Models (VLMs), this aspiration is progressively turning into reality. While contemporary research has explored automation of simple tasks on mobile devices via VLMs, there remains significant room for improvement in handling complex tasks and reducing high reasoning costs. In this paper, we introduce MobileExperts, which for the first time introduces tool formulation and multi-agent collaboration to address the aforementioned challenges. More specifically, MobileExperts dynamically assembles teams based on the alignment of agent portraits with the human requirements. Following this, each agent embarks on an independent exploration phase, formulating its tools to evolve into an expert. Lastly, we develop a dual-layer planning mechanism to establish coordinate collaboration among experts. To validate our effectiveness, we design a new benchmark of hierarchical intelligence levels, offering insights into algorithm's capability to address tasks across a spectrum of complexity. Experimental results demonstrate that MobileExperts performs better on all intelligence levels and achieves ~ 22% reduction in reasoning costs, thus verifying the superiority of our design.
翻译:在移动计算设备上实现自主操作一直是人类追求的目标。随着大语言模型(LLMs)和视觉语言模型(VLMs)的发展,这一愿景正逐步成为现实。尽管当前研究已探索通过VLMs实现移动设备上简单任务的自动化,但在处理复杂任务和降低高昂推理成本方面仍有巨大改进空间。本文提出MobileExperts,首次引入工具构建与多智能体协作机制以应对上述挑战。具体而言,MobileExperts根据智能体画像与人类需求的匹配度动态组建团队;随后每个智能体进入独立探索阶段,通过构建专属工具进化为领域专家;最后,我们设计双层规划机制以建立专家间的协同协作。为验证有效性,我们设计了包含分级智能水平的新基准测试,以深入评估算法处理不同复杂度任务的能力。实验结果表明,MobileExperts在所有智能层级上均表现更优,推理成本降低约22%,从而验证了本设计的优越性。