Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.
翻译:近年来,大语言模型的突破性进展催生了能执行复杂任务的智能体。本文提出一种基于大语言模型的新型多模态智能体框架,专门设计用于操控智能手机应用。该框架通过简化的动作空间使智能体能够操作智能手机应用,模拟点击、滑动等类人交互行为。这种创新方法无需系统后端访问权限,从而显著扩展了其在各类应用程序中的适用性。该智能体的核心在于其创新性的学习机制:通过自主探索或观察人类演示来学习导航和使用新应用,这一过程生成的知识库可供智能体在跨应用执行复杂任务时参考调用。为验证该智能体的实用性,我们在涵盖社交媒体、电子邮件、地图、购物及复杂图像编辑工具等10个不同应用的50项任务中进行了广泛测试。结果证实了该智能体在处理多样化高层级任务方面的卓越能力。