A History-Aware Visually Grounded Critic for Computer Use Agents

Jaewoo Lee,Zaid Khan,Archiki Prasad,Justin Chih-Yao Chen,Supriyo Chakraborty,Kartik Balasubramaniam,Sambit Sahu,Elias Stengel-Eskin,Hyunji Lee,Mohit Bansal

from arxiv, Code: https://github.com/G-JWLee/HiViG

Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy's completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.

翻译：针对计算机使用代理（CUA）的各种测试时干预方法，包括评判模型，已被开发出来以通过在复杂图形用户界面（GUI）环境中进行预执行动作评估来提升性能。然而，现有评判器存在两个关键局限性：（1）主要关注短视的决策循环（例如，遗忘早期动作）；（2）缺乏检测有缺陷动作所需的视觉基础（例如，点击错误的UI元素）。为解决这些问题，我们提出了HiViG，一个历史感知的视觉基础测试时框架，它基于一个在真实GUI轨迹上训练的多模态评判器构建，能够将过去的交互抽象为紧凑记录，并基于视觉基础评估动作。在测试时，HiViG将评判器集成到策略决策循环中，提供宏观动作历史（总结策略已完成成果）和视觉基础评判（对照当前截图验证原始执行坐标，以在执行前拦截错误）。在网页、移动端和桌面端基准测试中，HiViG持续优于现有的标量和语言评判器，相较于最强基线，在Qwen3-VL-32B和Gemini-3-Flash上分别将平均成功率提升了5.8%和9.0%，并展现出强大的跨平台泛化能力。消融实验表明，宏观动作历史缓解了短视规划问题，而视觉基础评判减少了执行错误，这两个组件对于长周期GUI任务中的测试时扩展至关重要。