Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task's characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned score with a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens. (2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. We introduce a novel PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence's last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a performance improvement of 3.7% over GUI-Actor-7B. Even with only 30% visual token retention, FocusUI-7B drops by only 3.2% while achieving up to 1.44x faster inference and 17% lower peak GPU memory.
翻译:视觉-语言模型(VLM)在处理日益高分辨率的屏幕截图方面展现出卓越能力,从而在用户界面(UI)定位任务中取得了显著性能。然而,屏幕截图被分词为数千个视觉令牌(例如,2K分辨率下约4700个),这带来了巨大的计算开销并稀释了注意力。相比之下,人类在与UI交互时通常只关注感兴趣区域。在本工作中,我们开创了高效UI定位这一任务。通过对任务特性与挑战的实际分析,我们提出了FocusUI——一种高效的UI定位框架,该框架能够选择与指令最相关的图像块,同时保持位置连续性以实现精确定位。FocusUI解决了两个关键挑战:(1)消除视觉编码中的冗余令牌。我们通过融合指令条件评分与基于规则的UI图评分来构建图像块级监督,后者对大面积同质区域进行降权,从而选择独特且与指令相关的视觉令牌。(2)在视觉令牌选择过程中保持位置连续性。我们发现,通用的视觉令牌剪枝方法会因位置信息断裂而在UI定位任务上遭受严重的精度下降。我们引入了一种新颖的PosPad策略,该策略将每个连续的被丢弃视觉令牌序列压缩为单个特殊标记,并将其置于该序列的最后一个索引位置,从而保持位置连续性。在四个定位基准上的综合实验表明,FocusUI超越了GUI专用基线模型。在ScreenSpot-Pro基准测试中,FocusUI-7B相比GUI-Actor-7B实现了3.7%的性能提升。即使在仅保留30%视觉令牌的情况下,FocusUI-7B的性能仅下降3.2%,同时推理速度最高提升1.44倍,GPU峰值内存降低17%。