Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for zero-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework that employs an iterative narrowing mechanism to further improve the performance of both general and fine-tuned models in GUI grounding. For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.
翻译:图形用户界面(GUI)定位在增强视觉语言模型(VLM)智能体的能力方面起着至关重要的作用。尽管通用VLM(如GPT-4V)在各类任务中展现出强大性能,但其在GUI定位方面的能力仍不理想。近期研究专注于对这些模型进行专门针对零样本GUI定位的微调,相较于基线性能取得了显著提升。本文提出一种视觉提示框架,该框架采用迭代式聚焦机制,以进一步提升通用模型与微调模型在GUI定位任务中的性能。为进行评估,我们在包含多种用户界面平台的综合性基准测试中验证了所提方法,并提供了可复现实验结果的代码。