Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.
翻译:自主图形用户界面(GUI)智能体面临两个基本挑战:过早停止(即智能体在缺乏可验证证据的情况下过早声称成功)以及重复循环(即智能体在相同失效动作中循环往复,无法恢复)。我们提出VLAA-GUI,一个模块化的GUI智能体框架,该框架围绕三个集成组件构建,指导系统何时停止、恢复与搜索。首先,一个强制性的完成度验证器在每一步结束时强制执行基于UI可观测的成功标准与验证——该验证器采用智能体级别的验证机制,通过决策规则交叉检验完成声明,并拒绝缺乏直接视觉证据的声明。其次,一个强制性的循环中断器提供多层过滤机制:在重复失败后切换交互模式,在持续屏幕状态复现后强制策略变更,并将反思信号与策略转变绑定。第三,一个按需调用的搜索智能体通过直接查询具备搜索能力的强大大语言模型在线上搜索不熟悉的工作流,并以纯文本形式返回结果。我们额外集成了一个编码智能体用于执行密集代码操作,以及一个定位智能体用于精确动作定位,两者均在需要时按需调用。我们在包含Linux和Windows任务的两个基准测试中,基于五种顶级骨干模型(包括Opus 4.5、4.6和Gemini 3.1 Pro)评估了VLAA-GUI,在两个基准测试上均取得了最优性能——在OSWorld上达到77.5%,在WindowsAgentArena上达到61.0%。值得注意的是,五个骨干模型中有三个在单次运行中超越了OSWorld上的人类表现(72.4%)。消融研究表明,所有三个提出的组件均能持续增强强骨干模型的表现,而弱骨干模型在步骤预算充足时从这些工具中获益更多。进一步分析还显示,循环中断器对于易循环模型几乎能将浪费的步骤减半。