GUIrilla：一种可扩展的自动化桌面用户界面探索框架 (GUIrilla: A Scalable Framework for Automated Desktop UI Exploration)

Autonomous agents capable of operating complex graphical user interfaces (GUIs) have the potential to transform desktop automation. While recent advances in large language models (LLMs) have significantly improved UI understanding, navigating full-window, multi-application desktop environments remains a major challenge. Data availability is limited by costly manual annotation, closed-source datasets and surface-level synthetic pipelines. We introduce GUIrilla, an automated scalable framework that systematically explores applications via native accessibility APIs to address the critical data collection challenge in GUI automation. Our framework focuses on macOS - an ecosystem with limited representation in current UI datasets - though many of its components are designed for broader cross-platform applicability. GUIrilla organizes discovered interface elements and crawler actions into hierarchical GUI graphs and employs specialized interaction handlers to achieve comprehensive application coverage. Using the application graphs from GUIrilla crawler, we construct and release GUIrilla-Task, a large-scale dataset of 27,171 functionally grounded tasks across 1,108 macOS applications, each annotated with full-desktop and window-level screenshots, accessibility metadata, and semantic action traces. Empirical results show that tuning LLM-based agents on GUIrilla-Task significantly improves performance on downstream UI tasks, outperforming synthetic baselines on the ScreenSpot Pro benchmark while using 97% less data. We also release macapptree, an open-source library for reproducible collection of structured accessibility metadata, along with the full GUIrilla-Task dataset, the manually verified GUIrilla-Gold benchmark, and the framework code to support open research in desktop autonomy.

翻译：能够操作复杂图形用户界面（GUI）的自主代理有潜力变革桌面自动化。尽管大型语言模型（LLMs）的最新进展显著提升了用户界面理解能力，但在全窗口、多应用程序的桌面环境中进行导航仍然是一个重大挑战。数据可用性受限于昂贵的手动标注、闭源数据集以及浅层的合成流程。我们提出了GUIrilla，这是一个自动化的可扩展框架，它通过原生无障碍访问API系统地探索应用程序，以应对GUI自动化中关键的数据收集挑战。我们的框架主要聚焦于macOS——一个在当前UI数据集中代表性有限的生态系统——尽管其许多组件设计时考虑了更广泛的跨平台适用性。GUIrilla将发现的界面元素和爬虫动作组织成分层GUI图，并采用专门的交互处理器以实现全面的应用程序覆盖。利用GUIrilla爬虫生成的应用程序图，我们构建并发布了GUIrilla-Task，这是一个包含27,171个功能基础任务的大规模数据集，涵盖1,108个macOS应用程序，每个任务都标注了全桌面和窗口级截图、无障碍访问元数据以及语义动作轨迹。实证结果表明，在GUIrilla-Task数据集上对基于LLM的代理进行微调，能显著提升其在下游UI任务上的性能，在ScreenSpot Pro基准测试中优于合成基线方法，同时数据使用量减少了97%。我们还发布了macapptree，一个用于可复现地收集结构化无障碍访问元数据的开源库，连同完整的GUIrilla-Task数据集、经过人工验证的GUIrilla-Gold基准测试以及框架代码，以支持桌面自主领域的开放研究。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日