We present a single-turn agent for graphical user interface (GUI) interaction tasks, using Vision-Language Model Florence-2-Base. The agent's primary task is identifying the screen coordinates of the UI element corresponding to the user's command. It demonstrates strong performance on Screenspot and OmniAct, while maintaining a compact size of 0.27B parameters and minimal latency. Relevant improvement comes from multi-task training and MLLM-based data augmentation. Manually annotated corpora are scarce, but we show that MLLM augmentation might produce better results. On Screenspot and OmniAct, our model outperforms both GUI-specific models (e.g., SeeClick) and MLLMs (e.g., GPT-4V).
翻译:本文提出一种基于视觉语言模型Florence-2-Base的图形用户界面(GUI)交互任务单轮智能体。该智能体的核心任务是识别用户指令对应界面元素的屏幕坐标。在Screenspot和OmniAct基准测试中,该模型展现出卓越性能,同时保持0.27B参数量级的紧凑架构与极低延迟。性能提升主要源于多任务训练与基于多模态大语言模型(MLLM)的数据增强技术。尽管人工标注数据集稀缺,但实验表明MLLM增强策略可能产生更优结果。在Screenspot与OmniAct测试集上,本模型性能均超越专用GUI模型(如SeeClick)及多模态大语言模型(如GPT-4V)。