Graphical User Interface (GUI) grounding aims to translate natural language instructions into executable screen coordinates, enabling automated GUI interaction. Nevertheless, incorrect grounding can result in costly, hard-to-reverse actions (e.g., erroneous payment approvals), raising concerns about model reliability. In this paper, we introduce SafeGround, an uncertainty-aware framework for GUI grounding models that enables risk-aware predictions through calibrations before testing. SafeGround leverages a distribution-aware uncertainty quantification method to capture the spatial dispersion of stochastic samples from outputs of any given model. Then, through the calibration process, SafeGround derives a test-time decision threshold with statistically guaranteed false discovery rate (FDR) control. We apply SafeGround on multiple GUI grounding models for the challenging ScreenSpot-Pro benchmark. Experimental results show that our uncertainty measure consistently outperforms existing baselines in distinguishing correct from incorrect predictions, while the calibrated threshold reliably enables rigorous risk control and potentials of substantial system-level accuracy improvements. Across multiple GUI grounding models, SafeGround improves system-level accuracy by up to 5.38% percentage points over Gemini-only inference.
翻译:图形用户界面(GUI)定位旨在将自然语言指令转化为可执行的屏幕坐标,从而实现自动化GUI交互。然而,错误的定位可能导致代价高昂且难以逆转的操作(例如错误的支付批准),这引发了人们对模型可靠性的担忧。本文提出SafeGround,一种面向GUI定位模型的不确定性感知框架,该框架通过测试前的校准实现风险感知预测。SafeGround利用一种分布感知的不确定性量化方法,来捕捉来自任何给定模型输出的随机样本的空间离散性。随后,通过校准过程,SafeGround推导出一个在测试时具有统计保证的误发现率(FDR)控制的决策阈值。我们将SafeGround应用于多个GUI定位模型,并在具有挑战性的ScreenSpot-Pro基准测试上进行评估。实验结果表明,我们的不确定性度量在区分正确与错误预测方面持续优于现有基线,同时经过校准的阈值能够可靠地实现严格的风险控制,并具备显著提升系统级准确性的潜力。在多个GUI定位模型上,与仅使用Gemini推理相比,SafeGround将系统级准确性最高提升了5.38个百分点。