WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state-often lead to planning errors. This issue is widespread in real user scenarios, but existing benchmarks fail to evaluate it. In this paper, we present WorldGUI, a novel GUI benchmark that designs GUI tasks with various initial states to simulate real computer-user interactions. The benchmark spans a wide range of tasks across 10 popular software applications, including PowerPoint, VSCode, and Adobe Acrobat. In addition, to address the challenges of dynamic GUI automation tasks, we propose GUI-Thinker, a holistic framework, leveraging a critique mechanism, that effectively manages the unpredictability and complexity of GUI interactions. Experimental results demonstrate that GUI-Thinker significantly outperforms Claude-3.5 (Computer Use) by 14.9% in success rate on WorldGUI tasks. This improvement underscores the effectiveness of our critical-thinking-based framework in enhancing GUI automation. The code is available at https://github.com/showlab/WorldGUI.

翻译：当前图形用户界面智能体在界面元素定位方面已取得卓越性能，然而任务规划仍面临严峻挑战，这主要源于其对环境初始状态的高度敏感性。具体而言，初始状态的细微差异——例如目标软件未启动或界面未处于默认状态——往往导致规划错误。该问题在真实用户场景中普遍存在，但现有基准测试体系未能有效评估此现象。本文提出WorldGUI这一创新型图形用户界面基准测试框架，通过设计具有多样化初始状态的任务来模拟真实的人机交互场景。该基准涵盖10款主流软件应用（包括PowerPoint、VSCode和Adobe Acrobat）的广泛任务类型。此外，为应对动态图形用户界面自动化任务的挑战，我们提出GUI-Thinker整体框架，该框架通过批判性推理机制有效管理图形用户界面交互的不确定性与复杂性。实验结果表明，在WorldGUI任务中，GUI-Thinker的成功率较Claude-3.5（计算机使用）显著提升14.9%。这一改进凸显了我们基于批判性思维的框架在增强图形用户界面自动化方面的有效性。相关代码已发布于https://github.com/showlab/WorldGUI。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日