Pursuing human-like interaction for Graphical User Interface (GUI) agents requires understanding the GUI context and following user instructions. However, existing works typically couple these two aspects and focus more on instruct-following abilities, while ignoring the importance of understanding the GUI context. In this paper, we introduce an instruction-free GUI navigation dataset, termed Insight-UI Dataset, to enhance model comprehension of GUI environments. Insight-UI Dataset is automatically generated from the Common Crawl corpus, simulating various platforms -- including iOS, Android, Windows, and Linux -- across multiple resolutions on 312K domains. Although GUI interactions vary by context, diverse interfaces share common internal patterns, such as clicking an item to view its details. It implies the feasibility of independent GUI operation learning, followed by joint optimization with instruction tuning. Thereby, we develop the GUI agent model Falcon-UI, which is initially pretrained on Insight-UI Dataset and subsequently fine-tuned on Android and Web GUI datasets, including AITW, AITZ, Android Control, and Mind2Web. With 7 billion parameters, Falcon-UI achieves accuracy comparable to the 72 billion-parameter Qwen2VL on AITZ, validating the alignment between GUI context comprehension and agent performance. Our code and dataset will be open-sourced.
翻译:追求图形用户界面(GUI)智能体实现类人交互,需要理解GUI上下文并遵循用户指令。然而,现有工作通常将这两个方面耦合,并更侧重于指令遵循能力,而忽视了理解GUI上下文的重要性。本文引入了一个无指令的GUI导航数据集,称为Insight-UI数据集,以增强模型对GUI环境的理解。Insight-UI数据集是从Common Crawl语料库中自动生成的,模拟了包括iOS、Android、Windows和Linux在内的多种平台,覆盖了312K个域名下的多种分辨率。尽管GUI交互因上下文而异,但多样化的界面共享着共同的内部模式,例如点击一个项目以查看其详细信息。这暗示了独立学习GUI操作,随后与指令微调进行联合优化的可行性。因此,我们开发了GUI智能体模型Falcon-UI,该模型首先在Insight-UI数据集上进行预训练,随后在Android和Web GUI数据集(包括AITW、AITZ、Android Control和Mind2Web)上进行微调。拥有70亿参数的Falcon-UI在AITZ上达到了与720亿参数的Qwen2VL相当的准确率,验证了GUI上下文理解与智能体性能之间的关联性。我们的代码和数据集将开源。