Mobile graphical user interface (GUI) agents are designed to automate everyday tasks on smartphones. Recent advances in large language models (LLMs) have significantly enhanced the capabilities of mobile GUI agents. However, most LLM-powered mobile GUI agents operate in stepwise query-act loops, which incur high latency due to repeated LLM queries. We present GraphPilot, a mobile GUI agent that leverages knowledge graphs of the target apps to complete user tasks in almost one LLM query. GraphPilot operates in two complementary phases to enable efficient and reliable LLM-powered GUI task automation. In the offline phase, it explores target apps, records and analyzes interaction history, and constructs an app-specific knowledge graph that encodes functions of pages and elements as well as transition rules for each app. In the online phase, given an app and a user task, it leverages the knowledge graph of the given app to guide the reasoning process of LLM. When the reasoning process encounters uncertainty, GraphPilot dynamically requests the HTML representation of the current interface to refine subsequent reasoning. Finally, a validator checks the generated sequence of actions against the transition rules in the knowledge graph, performing iterative corrections to ensure it is valid. The structured, informative information in the knowledge graph allows the LLM to plan the complete sequence of actions required to complete the user task. On the DroidTask benchmark, GraphPilot improves task completion rate over Mind2Web and AutoDroid, while substantially reducing latency and the number of LLM queries.
翻译:移动图形用户界面(GUI)代理旨在自动化智能手机上的日常任务。大语言模型(LLM)的最新进展显著增强了移动GUI代理的能力。然而,大多数基于LLM的移动GUI代理以逐步的查询-执行循环方式运行,由于重复的LLM查询而导致高延迟。我们提出了GraphPilot,一种利用目标应用程序知识图谱的移动GUI代理,能够以几乎一次LLM查询完成用户任务。GraphPilot通过两个互补阶段实现高效可靠的LLM驱动的GUI任务自动化。在离线阶段,它探索目标应用程序,记录并分析交互历史,构建特定于应用程序的知识图谱,该图谱编码了页面和元素的功能以及每个应用程序的转换规则。在线阶段,给定一个应用程序和用户任务,它利用该应用程序的知识图谱来指导LLM的推理过程。当推理过程遇到不确定性时,GraphPilot动态请求当前界面的HTML表示以优化后续推理。最后,验证器根据知识图谱中的转换规则检查生成的动作序列,执行迭代修正以确保其有效性。知识图谱中结构化、信息丰富的内容使LLM能够规划完成用户任务所需的完整动作序列。在DroidTask基准测试中,GraphPilot相比Mind2Web和AutoDroid提高了任务完成率,同时显著降低了延迟和LLM查询次数。