Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for one-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework that employs an iterative narrowing mechanism to improve the performance of both general and fine-tuned models in GUI grounding by up to 61%. For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.
翻译:图形用户界面(GUI)定位在增强视觉语言模型(VLM)智能体能力方面发挥着关键作用。尽管通用VLM(如GPT-4V)在各类任务中展现出强大性能,但其在GUI定位方面的能力仍不尽如人意。近期研究集中于通过微调使这些模型专门适用于单次GUI定位任务,相比基线性能取得了显著提升。本文提出一种视觉提示框架,该框架采用迭代精化机制,将通用模型与微调模型在GUI定位中的性能提升最高达61%。为评估该方法,我们在包含多种用户界面平台的综合基准测试上进行了验证,并提供了可复现实验结果的代码。