Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by approximately 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed. This suggests that the commonly observed saturation in visual history is not due to limited usefulness of past information, but rather a consequence of inefficient token representations.
翻译:计算机使用智能体依赖图形用户界面的视觉观测,每张截图被编码为大量视觉令牌。随着交互轨迹的延长,令牌成本急剧增加,在固定上下文和计算预算下限制了可纳入的历史信息量。这导致与其他领域不同,使用历史记录后性能未获改善或改善幅度极为有限。为解决该低效问题,我们引入ReVision——通过可学习的补丁选择器,在保留模型所需空间结构的同时比较连续截图的补丁表征,从而从轨迹中移除冗余视觉补丁,并以此训练多模态语言模型。在OSWorld、WebTailBench和AgentNetBench三个基准测试中,使用Qwen2.5-VL-7B处理含5张历史截图的轨迹时,ReVision在平均减少约46%令牌用量的同时,将成功率较无丢弃基线提升3%。这建立了明确的效率增益,使智能体能够以更少令牌处理更长轨迹。借助改进后的效率,我们重新审视了历史记录在计算机使用智能体中的作用,发现消除冗余后,纳入更多历史观测数据可持续提升性能。这表明视觉历史中常见的饱和现象并非源于历史信息的有限效用,而是令牌表征低效的结果。