Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but many perform poorly on UI tasks due to the lack of UI training data. In this paper, we adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM). Unlike prior art, our method requires no human-provided annotations, and it can be applied to any dataset of UI screenshots. We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks. To assess the performance of our model, we benchmark it on UI element detection tasks, evaluate response quality, and showcase its applicability to multi-step UI navigation and planning.
翻译:多模态视觉语言模型(VLM)凭借其对图像与语言的融合理解能力,能够实现强大的应用功能。但由于缺乏UI训练数据,许多模型在UI任务上表现不佳。本文将生成配对文本-图像训练数据的VLM方法适配至UI领域,通过结合现有基于像素的方法与大语言模型(LLM)实现。与现有技术不同,本方法无需人工标注,可应用于任意UI截图数据集。我们构建了包含33.5万条对话示例(涵盖问答、UI描述与规划任务)的配对UI数据集,并基于此微调面向UI任务的对话式VLM。为评估模型性能,我们在UI元素检测任务上进行基准测试,评价响应质量,并展示其在多步UI导航与规划中的适用性。