通过用户界面分解与合成实现计算机使用基础任务的大规模扩展 (Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis)

Tianbao Xie,Jiaqi Deng,Xiaochuan Li,Junlin Yang,Haoyuan Wu,Jixuan Chen,Wenjing Hu,Xinyuan Wang,Yuhui Xu,Zekun Wang,Yiheng Xu,Junli Wang,Doyen Sahoo,Tao Yu,Caiming Xiong

from arxiv, 49 pages, 13 figures

Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.

翻译：图形用户界面（GUI）基础任务，即将自然语言指令映射到图形用户界面上具体操作的能力，仍然是计算机使用智能体开发的关键瓶颈。当前基准测试将基础任务过度简化为简短的指代表达式，未能捕捉现实世界交互的复杂性，这些交互需要软件常识、布局理解和细粒度操作能力。为应对这些局限，我们引入了OSWorld-G，这是一个包含564个精细标注样本的综合基准测试，涵盖文本匹配、元素识别、布局理解和精确操作等多种任务类型。此外，我们合成并发布了最大的计算机使用基础任务数据集Jedi，该数据集通过任务的多视角解耦包含400万个示例。在Jedi上训练的多尺度模型证明了其有效性，在ScreenSpot-v2、ScreenSpot-Pro以及我们的OSWorld-G上均超越了现有方法。进一步地，我们证明利用Jedi改进的基础能力能直接增强通用基础模型在复杂计算机任务上的智能体能力，在OSWorld上的表现从5%提升至27%。通过详细的消融研究，我们识别了影响基础任务性能的关键因素，并验证了结合针对不同界面元素的专门数据能够实现对新界面的组合泛化能力。所有基准测试、数据、检查点和代码均已开源，可通过https://osworld-grounding.github.io获取。