Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.
翻译:纯视觉图形用户界面智能体虽具备通用交互能力,但由于高分辨率屏幕截图与历史轨迹中固有的海量时空冗余,其效率存在严重瓶颈。我们识别出现有压缩范式中两个关键错位问题:一是时间错配,即均匀历史编码与智能体"渐逝记忆"注意力模式相背离;二是空间拓扑冲突,即非结构化剪枝破坏了精确定位所需的网格完整性,引发空间幻觉。为解决这些挑战,我们提出GUIPruner——一个专为高分辨率图形用户界面导航设计的免训练框架。该框架协同整合了两种机制:时间自适应分辨率,通过基于衰减的尺寸调整消除历史冗余;分层结构感知剪枝,在保障全局布局的同时优先保留交互前景与语义锚点。跨多个基准的广泛评估表明,GUIPruner始终能实现最先进的性能,有效防止大规模模型在高压缩率下出现的性能崩溃。值得注意的是,在Qwen2-VL-2B模型上,本方法将浮点运算量降低3.4倍,视觉编码延迟加速3.3倍,同时保持超过94%的原始性能,从而以最小资源消耗实现实时高精度导航。