Digital agents for automating tasks across different platforms by directly manipulating the GUIs are increasingly important. For these agents, grounding from language instructions to target elements remains a significant challenge due to reliance on HTML or AXTree inputs. In this paper, we introduce Aria-UI, a large multimodal model specifically designed for GUI grounding. Aria-UI adopts a pure-vision approach, eschewing reliance on auxiliary inputs. To adapt to heterogeneous planning instructions, we propose a scalable data pipeline that synthesizes diverse and high-quality instruction samples for grounding. To handle dynamic contexts in task performing, Aria-UI incorporates textual and text-image interleaved action histories, enabling robust context-aware reasoning for grounding. Aria-UI sets new state-of-the-art results across offline and online agent benchmarks, outperforming both vision-only and AXTree-reliant baselines. We release all training data and model checkpoints to foster further research at https://ariaui.github.io.
翻译:通过直接操控图形用户界面(GUI)来实现跨平台任务自动化的数字智能体正日益重要。对于这类智能体而言,由于依赖HTML或AXTree等输入,将语言指令与目标界面元素进行基础关联仍然是一个重大挑战。本文提出了Aria-UI,一个专门为GUI基础任务设计的大型多模态模型。Aria-UI采用纯视觉方法,摒弃了对辅助输入的依赖。为了适应异构的任务规划指令,我们提出了一个可扩展的数据流水线,能够合成多样化且高质量的指令样本用于基础关联。为了处理任务执行过程中的动态上下文,Aria-UI整合了文本及文本-图像交错的动作历史记录,从而实现了鲁棒的、具备上下文感知能力的基础推理。Aria-UI在离线和在线智能体基准测试中均取得了最新的最优性能,超越了仅依赖视觉和依赖AXTree的基线方法。我们在 https://ariaui.github.io 发布了全部训练数据和模型检查点,以促进进一步的研究。