ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation

Difei Gao,Lei Ji,Zechen Bai,Mingyu Ouyang,Peiran Li,Dongxing Mao,Qinchen Wu,Weichen Zhang,Peiyi Wang,Xiangwu Guo,Hengxu Wang,Luowei Zhou,Mike Zheng Shou

Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity. Existing works leveraging Large Language Model (LLM) or LLM-based AI agents have shown capabilities in automating tasks on Android and Web platforms. However, these tasks are primarily aimed at simple device usage and entertainment operations. This paper presents a novel benchmark, AssistGUI, to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks. We carefully collected a set of 100 tasks from nine widely-used software applications, such as, After Effects and MS Word, each accompanied by the necessary project files for better evaluation. Moreover, we propose an advanced Actor-Critic Embodied Agent framework, which incorporates a sophisticated GUI parser driven by an LLM-agent and an enhanced reasoning mechanism adept at handling lengthy procedural tasks. Our experimental results reveal that our GUI Parser and Reasoning mechanism outshine existing methods in performance. Nevertheless, the potential remains substantial, with the best model attaining only a 46% success rate on our benchmark. We conclude with a thorough analysis of the current methods' limitations, setting the stage for future breakthroughs in this domain.

翻译：图形用户界面（GUI）自动化在辅助用户完成复杂任务方面具有巨大潜力，有望提升人类生产力。现有工作利用大型语言模型（LLM）或基于LLM的AI代理，已在Android和Web平台的任务自动化中展现出能力。然而，这些任务主要针对简单设备使用和娱乐操作。本文提出一个新颖的基准测试AssistGUI，用于评估模型能否在Windows平台上根据用户请求的任务操控鼠标和键盘。我们精心收集了来自九款常用软件（如After Effects、MS Word）的100个任务，每个任务均附有必要的项目文件以便更好地评估。此外，我们提出了一种先进的演员-评论家具身代理框架，该框架融合了由LLM代理驱动的复杂GUI解析器以及专门处理长流程任务的增强推理机制。实验结果表明，我们的GUI解析器与推理机制在性能上优于现有方法。然而，潜力仍十分巨大，最佳模型在基准测试中仅达到46%的成功率。最后，我们对当前方法的局限性进行了深入分析，为该领域的未来突破奠定基础。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日