Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper: while VLMs exhibit significant latent grounding potential, as demonstrated by their performance measured by Pointing Game, they underperform when tasked with outputting explicit coordinates. To address this discrepancy and bypass the high data and annotation costs of current fine-tuning approaches, we propose three zero-shot auxiliary reasoning methods. By providing explicit spatial cues such as axes, grids and labeled intersections as part of the input image, these methods enable VLMs to better articulate their implicit spatial understanding capabilities. We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs. Experimental results show substantial gains from auxiliary reasoning. Mark-Grid Scaffold boosts Gemini-3.1-Pro from 11.72\% under direct inference to 95.20\% on ScreenSpot-v2, achieves state-of-the-art performance on ScreenSpot, and approaches the strongest fine-tuned methods on ScreenSpot-v2 and UI-I2E-Bench. Our code is available at https://github.com/liweim/AuxiliaryReasoning.
翻译:图形用户界面(GUI)定位是为构建GUI智能体所需要处理的基础任务。然而,通用视觉语言模型(VLM)由于缺乏特定优化,在该任务上表现不佳。本文识别了一个关键差异:尽管VLM在“指向游戏”评测指标下展现出显著的潜在定位能力,但它们在输出精确坐标时表现欠佳。为弥合这一差距并规避当前微调方法高昂的数据与标注成本,我们提出了三种零样本辅助推理方法。通过在输入图像中提供显式空间线索(如坐标轴、网格及标注的交点),这些方法使VLM能够更好地表达其隐含的空间理解能力。我们在四个GUI定位基准上对七种开源与专有VLM进行了评估。实验结果表明,辅助推理带来了显著性能提升。其中,Mark-Grid Scaffold方法将Gemini-3.1-Pro在ScreenSpot-v2上的性能从直接推理的11.72%提升至95.20%,在ScreenSpot上达到最优性能,并在ScreenSpot-v2与UI-I2E-Bench上逼近最强的微调方法。我们的代码已开源至https://github.com/liweim/AuxiliaryReasoning。