Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI. The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms. We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents. Empirical results on six benchmarks spanning three categories (grounding, offline agent, and online agent) show that 1) UGround substantially outperforms existing visual grounding models for GUI agents, by up to 20% absolute, and 2) agents with UGround outperform state-of-the-art agents, despite the fact that existing agents use additional text-based input while ours only uses visual perception. These results provide strong support for the feasibility and promises of GUI agents that navigate the digital world as humans do.

翻译：多模态大语言模型（MLLMs）正在改变图形用户界面（GUI）智能体的能力，推动其从受控模拟环境向跨平台的复杂真实世界应用过渡。然而，这些智能体的有效性取决于其基础能力的鲁棒性。当前的GUI智能体主要依赖基于文本的表示形式，如HTML或无障碍功能树，这些方法尽管实用，却常常引入噪声、信息不完整并增加计算开销。本文主张为GUI智能体赋予类人的具身体验，使其完全通过视觉感知环境，并直接在GUI上执行像素级操作。关键在于视觉基础模型能够准确地将多样化的GUI元素指代表达式映射到跨平台GUI上的对应坐标。我们证明，一个包含基于网络的合成数据以及对LLaVA架构进行轻微调整的简单方案，对于训练此类视觉基础模型出人意料地有效。我们收集了迄今为止最大的GUI视觉基础数据集，包含1000万个GUI元素及其在130万张屏幕截图中的指代表达式，并利用该数据集训练了UGround——一个面向GUI智能体的强大通用视觉基础模型。在涵盖三大类别（基础任务、离线智能体和在线智能体）的六个基准测试上的实证结果表明：1）UGround显著优于现有的GUI智能体视觉基础模型，绝对优势最高达20%；2）配备UGround的智能体超越了现有最先进的智能体，尽管现有智能体使用了额外的文本输入，而我们的智能体仅依赖视觉感知。这些结果为“像人类一样驾驭数字世界”的GUI智能体的可行性与前景提供了有力支撑。