The limitation of graphical user interface (GUI) data has been a significant barrier to the development of GUI agents today, especially for the desktop / computer use scenarios. To address this, we propose an automated GUI data generation pipeline, AutoCaptioner, which generates data with rich descriptions while minimizing human effort. Using AutoCaptioner, we created a novel large-scale desktop GUI dataset, DeskVision, along with the largest desktop test benchmark, DeskVision-Eval, which reflects daily usage and covers diverse systems and UI elements, each with rich descriptions. With DeskVision, we train a new GUI understanding model, GUIExplorer. Results show that GUIExplorer achieves state-of-the-art (SOTA) performance in understanding/grounding visual elements without the need for complex architectural designs. We further validated the effectiveness of the DeskVision dataset through ablation studies on various large visual language models (LVLMs). We believe that AutoCaptioner and DeskVision will significantly advance the development of GUI agents, and will open-source them for the community.
翻译:当前图形用户界面(GUI)数据的局限性已成为GUI代理发展的重大障碍,尤其在桌面/计算机使用场景中。为此,我们提出了一种自动化GUI数据生成流程AutoCaptioner,该流程能够在最小化人工投入的同时生成包含丰富描述的数据。利用AutoCaptioner,我们创建了新颖的大规模桌面GUI数据集DeskVision,以及最大的桌面测试基准DeskVision-Eval。该数据集反映了日常使用场景,涵盖多样化的操作系统与UI元素,且每个样本均配有详细描述。基于DeskVision,我们训练了新型GUI理解模型GUIExplorer。实验结果表明,GUIExplorer在无需复杂架构设计的情况下,在视觉元素理解与定位任务中达到了最先进的性能水平。我们进一步通过对多种大型视觉语言模型(LVLM)的消融实验验证了DeskVision数据集的有效性。我们相信AutoCaptioner与DeskVision将显著推动GUI代理的发展,并将开源这些资源供社区使用。