Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity. Existing works leveraging Large Language Model (LLM) or LLM-based AI agents have shown capabilities in automating tasks on Android and Web platforms. However, these tasks are primarily aimed at simple device usage and entertainment operations. This paper presents a novel benchmark, AssistGUI, to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks. We carefully collected a set of 100 tasks from nine widely-used software applications, such as, After Effects and MS Word, each accompanied by the necessary project files for better evaluation. Moreover, we propose an advanced Actor-Critic Embodied Agent framework, which incorporates a sophisticated GUI parser driven by an LLM-agent and an enhanced reasoning mechanism adept at handling lengthy procedural tasks. Our experimental results reveal that our GUI Parser and Reasoning mechanism outshine existing methods in performance. Nevertheless, the potential remains substantial, with the best model attaining only a 46% success rate on our benchmark. We conclude with a thorough analysis of the current methods' limitations, setting the stage for future breakthroughs in this domain.
翻译:图形用户界面(GUI)自动化在辅助用户完成复杂任务方面具有巨大潜力,有望提升人类生产力。现有工作利用大型语言模型(LLM)或基于LLM的AI代理,已在Android和Web平台的任务自动化中展现出能力。然而,这些任务主要针对简单设备使用和娱乐操作。本文提出一个新颖的基准测试AssistGUI,用于评估模型能否在Windows平台上根据用户请求的任务操控鼠标和键盘。我们精心收集了来自九款常用软件(如After Effects、MS Word)的100个任务,每个任务均附有必要的项目文件以便更好地评估。此外,我们提出了一种先进的演员-评论家具身代理框架,该框架融合了由LLM代理驱动的复杂GUI解析器以及专门处理长流程任务的增强推理机制。实验结果表明,我们的GUI解析器与推理机制在性能上优于现有方法。然而,潜力仍十分巨大,最佳模型在基准测试中仅达到46%的成功率。最后,我们对当前方法的局限性进行了深入分析,为该领域的未来突破奠定基础。