GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.
翻译:GUI定位任务旨在根据自然语言查询从截图中定位界面元素,但小图标和密集布局仍构成挑战。测试时放大方法通过裁剪并以更高分辨率重新运行推理来改进定位,但该方法对所有实例采用固定的裁剪尺寸进行统一处理,忽略了模型是否对每个案例存在不确定性。我们提出**UI-Zoomer**,一种无需训练的自适应放大框架,将放大的触发条件和缩放尺度统一视为预测不确定性量化问题。置信度感知门控机制通过融合随机候选样本间的空间一致性与词元级生成置信度,仅在定位结果不确定时选择性触发放大操作。当触发时,不确定性驱动的裁剪尺寸模块将预测方差分解为样本间位置扩散与样本内边界框范围,通过总方差定律推导出每个实例的裁剪半径。在ScreenSpot-Pro、UI-Vision和ScreenSpot-v2上的大量实验表明,该方法在多个模型架构上相较于强基线取得持续改进,分别增益高达+13.4%、+10.3%和+4.2%,且无需额外训练。