Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity. Existing works leveraging Large Language Model (LLM) or LLM-based AI agents have shown capabilities in automating tasks on Android and Web platforms. However, these tasks are primarily aimed at simple device usage and entertainment operations. This paper presents a novel benchmark, AssistGUI, to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks. We carefully collected a set of 100 tasks from nine widely-used software applications, such as, After Effects and MS Word, each accompanied by the necessary project files for better evaluation. Moreover, we propose an advanced Actor-Critic Embodied Agent framework, which incorporates a sophisticated GUI parser driven by an LLM-agent and an enhanced reasoning mechanism adept at handling lengthy procedural tasks. Our experimental results reveal that our GUI Parser and Reasoning mechanism outshine existing methods in performance. Nevertheless, the potential remains substantial, with the best model attaining only a 46% success rate on our benchmark. We conclude with a thorough analysis of the current methods' limitations, setting the stage for future breakthroughs in this domain.
翻译:图形用户界面(GUI)自动化在辅助用户完成复杂任务方面具有重要前景,从而提升人类生产力。现有利用大语言模型(LLM)或基于大语言模型的人工智能代理的工作已展现出在安卓和网页平台自动化任务的能力。然而,这些任务主要针对简单的设备使用和娱乐操作。本文提出一个新颖的基准测试AssistGUI,用于评估模型是否能够在Windows平台上根据用户请求的任务操纵鼠标和键盘。我们精心收集了来自九个广泛使用的软件应用程序(如After Effects和MS Word)的100个任务,每个任务附带了必要的项目文件以进行更好的评估。此外,我们提出了一种先进的演员-评论家具身代理框架,该框架集成了由LLM代理驱动的复杂GUI解析器以及一种擅长处理长程序任务的增强推理机制。我们的实验结果表明,我们的GUI解析器和推理机制在性能上优于现有方法。尽管如此,潜力依然巨大,最佳模型在我们的基准测试上仅达到46%的成功率。最后,我们深入分析了当前方法的局限性,为该领域的未来突破奠定基础。