ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation

Difei Gao,Lei Ji,Zechen Bai,Mingyu Ouyang,Peiran Li,Dongxing Mao,Qinchen Wu,Weichen Zhang,Peiyi Wang,Xiangwu Guo,Hengxu Wang,Luowei Zhou,Mike Zheng Shou

from arxiv, Project Page: https://showlab.github.io/assistgui/

Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity. Existing works leveraging Large Language Model (LLM) or LLM-based AI agents have shown capabilities in automating tasks on Android and Web platforms. However, these tasks are primarily aimed at simple device usage and entertainment operations. This paper presents a novel benchmark, AssistGUI, to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks. We carefully collected a set of 100 tasks from nine widely-used software applications, such as, After Effects and MS Word, each accompanied by the necessary project files for better evaluation. Moreover, we propose an advanced Actor-Critic Embodied Agent framework, which incorporates a sophisticated GUI parser driven by an LLM-agent and an enhanced reasoning mechanism adept at handling lengthy procedural tasks. Our experimental results reveal that our GUI Parser and Reasoning mechanism outshine existing methods in performance. Nevertheless, the potential remains substantial, with the best model attaining only a 46% success rate on our benchmark. We conclude with a thorough analysis of the current methods' limitations, setting the stage for future breakthroughs in this domain.

翻译：图形用户界面（GUI）自动化在辅助用户完成复杂任务方面具有重要前景，从而提升人类生产力。现有利用大语言模型（LLM）或基于大语言模型的人工智能代理的工作已展现出在安卓和网页平台自动化任务的能力。然而，这些任务主要针对简单的设备使用和娱乐操作。本文提出一个新颖的基准测试AssistGUI，用于评估模型是否能够在Windows平台上根据用户请求的任务操纵鼠标和键盘。我们精心收集了来自九个广泛使用的软件应用程序（如After Effects和MS Word）的100个任务，每个任务附带了必要的项目文件以进行更好的评估。此外，我们提出了一种先进的演员-评论家具身代理框架，该框架集成了由LLM代理驱动的复杂GUI解析器以及一种擅长处理长程序任务的增强推理机制。我们的实验结果表明，我们的GUI解析器和推理机制在性能上优于现有方法。尽管如此，潜力依然巨大，最佳模型在我们的基准测试上仅达到46%的成功率。最后，我们深入分析了当前方法的局限性，为该领域的未来突破奠定基础。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日