Autonomous agents capable of operating complex graphical user interfaces (GUIs) have the potential to transform desktop automation. While recent advances in large language models (LLMs) have significantly improved UI understanding, navigating full-window, multi-application desktop environments remains a major challenge. Data availability is limited by costly manual annotation, closed-source datasets and surface-level synthetic pipelines. We introduce GUIrilla, an automated scalable framework that systematically explores applications via native accessibility APIs to address the critical data collection challenge in GUI automation. Our framework focuses on macOS - an ecosystem with limited representation in current UI datasets - though many of its components are designed for broader cross-platform applicability. GUIrilla organizes discovered interface elements and crawler actions into hierarchical GUI graphs and employs specialized interaction handlers to achieve comprehensive application coverage. Using the application graphs from GUIrilla crawler, we construct and release GUIrilla-Task, a large-scale dataset of 27,171 functionally grounded tasks across 1,108 macOS applications, each annotated with full-desktop and window-level screenshots, accessibility metadata, and semantic action traces. Empirical results show that tuning LLM-based agents on GUIrilla-Task significantly improves performance on downstream UI tasks, outperforming synthetic baselines on the ScreenSpot Pro benchmark while using 97% less data. We also release macapptree, an open-source library for reproducible collection of structured accessibility metadata, along with the full GUIrilla-Task dataset, the manually verified GUIrilla-Gold benchmark, and the framework code to support open research in desktop autonomy.
翻译:能够操作复杂图形用户界面(GUI)的自主代理有潜力变革桌面自动化。尽管大型语言模型(LLMs)的最新进展显著提升了用户界面理解能力,但在全窗口、多应用程序的桌面环境中进行导航仍然是一个重大挑战。数据可用性受限于昂贵的手动标注、闭源数据集以及浅层的合成流程。我们提出了GUIrilla,这是一个自动化的可扩展框架,它通过原生无障碍访问API系统地探索应用程序,以应对GUI自动化中关键的数据收集挑战。我们的框架主要聚焦于macOS——一个在当前UI数据集中代表性有限的生态系统——尽管其许多组件设计时考虑了更广泛的跨平台适用性。GUIrilla将发现的界面元素和爬虫动作组织成分层GUI图,并采用专门的交互处理器以实现全面的应用程序覆盖。利用GUIrilla爬虫生成的应用程序图,我们构建并发布了GUIrilla-Task,这是一个包含27,171个功能基础任务的大规模数据集,涵盖1,108个macOS应用程序,每个任务都标注了全桌面和窗口级截图、无障碍访问元数据以及语义动作轨迹。实证结果表明,在GUIrilla-Task数据集上对基于LLM的代理进行微调,能显著提升其在下游UI任务上的性能,在ScreenSpot Pro基准测试中优于合成基线方法,同时数据使用量减少了97%。我们还发布了macapptree,一个用于可复现地收集结构化无障碍访问元数据的开源库,连同完整的GUIrilla-Task数据集、经过人工验证的GUIrilla-Gold基准测试以及框架代码,以支持桌面自主领域的开放研究。