This paper introduces DroidBot-GPT, a tool that utilizes GPT-like large language models (LLMs) to automate the interactions with Android mobile applications. Given a natural language description of a desired task, DroidBot-GPT can automatically generate and execute actions that navigate the app to complete the task. It works by translating the app GUI state information and the available actions on the smartphone screen to natural language prompts and asking the LLM to make a choice of actions. Since the LLM is typically trained on a large amount of data including the how-to manuals of diverse software applications, it has the ability to make reasonable choices of actions based on the provided information. We evaluate DroidBot-GPT with a self-created dataset that contains 33 tasks collected from 17 Android applications spanning 10 categories. It can successfully complete 39.39% of the tasks, and the average partial completion progress is about 66.76%. Given the fact that our method is fully unsupervised (no modification required from both the app and the LLM), we believe there is great potential to enhance automation performance with better app development paradigms and/or custom model training.
翻译:本文介绍DroidBot-GPT,一个利用GPT类大型语言模型(LLM)自动执行Android移动应用程序交互操作的工具。给定目标任务的自然语言描述,DroidBot-GPT能够自动生成并执行导航指令,驱动应用完成该任务。其核心机制是将应用图形用户界面(GUI)的状态信息及智能手机屏幕上的可用操作转化为自然语言提示,并引导LLM选择执行动作。由于LLM通常基于海量数据进行训练(涵盖各类软件应用的操作手册),因此具备依据所提供信息做出合理操作决策的能力。我们通过自建数据集对DroidBot-GPT进行评估,该数据集包含来自17款Android应用的33个任务(覆盖10个应用类别)。实验结果显示,该工具能成功完成39.39%的任务,平均部分完成进度约为66.76%。鉴于本方法完全无监督(无需对应用和LLM进行任何修改),我们认为通过改进应用开发范式或定制模型训练,自动化性能将具有显著的提升潜力。