Precise localization of GUI elements is crucial for the development of GUI agents. Traditional methods rely on bounding box or center-point regression, neglecting spatial interaction uncertainty and visual-semantic hierarchies. Recent methods incorporate attention mechanisms but still face two key issues: (1) ignoring processing background regions causes attention drift from the desired area, and (2) uniform modeling the target UI element fails to distinguish between its center and edges, leading to click imprecision. Inspired by how humans visually process and interact with GUI elements, we propose the Valley-to-Peak (V2P) method to address these issues. To mitigate background distractions, V2P introduces a suppression attention mechanism that minimizes the model's focus on irrelevant regions to highlight the intended region. For the issue of center-edge distinction, V2P applies a Fitts' Law-inspired approach by modeling GUI interactions as 2D Gaussian heatmaps where the weight gradually decreases from the center towards the edges. The weight distribution follows a Gaussian function, with the variance determined by the target's size. Consequently, V2P effectively isolates the target area and teaches the model to concentrate on the most essential point of the UI element. The model trained by V2P achieves the performance with 92.4\% and 52.5\% on two benchmarks ScreenSpot-v2 and ScreenSpot-Pro (see Fig.~\ref{fig:main_results_charts}). Ablations further confirm each component's contribution, underscoring V2P's generalizability in precise GUI grounding tasks and its potential for real-world deployment in future GUI agents.
翻译:精确的图形用户界面(GUI)元素定位对GUI智能体开发至关重要。传统方法依赖边界框或中心点回归,忽略了空间交互的不确定性与视觉语义层次。近期方法虽引入注意力机制,仍面临两个关键问题:(1)忽视背景区域处理会导致注意力从目标区域漂移;(2)对目标UI元素的均匀建模无法区分其中心与边缘,导致点击不精确。受人类视觉处理及与GUI元素交互方式的启发,我们提出谷峰转换(V2P)方法以解决这些问题。为降低背景干扰,V2P引入抑制注意力机制,通过最小化模型对无关区域的关注来突出目标区域。针对中心-边缘区分问题,V2P采用基于费茨定律的方法,将GUI交互建模为二维高斯热图,其权重从中心向边缘逐渐衰减。权重分布遵循高斯函数,方差由目标尺寸决定。因此,V2P能有效隔离目标区域,并引导模型聚焦于UI元素最核心的位置。经V2P训练的模型在ScreenSpot-v2和ScreenSpot-Pro两个基准测试中分别达到92.4%和52.5%的性能(见图~\ref{fig:main_results_charts})。消融实验进一步验证了各模块的贡献,彰显了V2P在精确GUI定位任务中的泛化能力及其在未来GUI智能体实际部署中的潜力。