Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to LLMs. We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model's reasoning ability, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and function inference. After training on the curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. For model evaluation, we establish a comprehensive benchmark encompassing all the aforementioned tasks. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.

翻译：近年来，多模态大语言模型（MLLMs）取得了显著进展，然而这些通用领域的MLLMs在理解用户界面屏幕并与之有效交互方面往往能力不足。本文提出了Ferret-UI，一种专为增强移动界面屏幕理解而设计的新型MLLM，具备指代、定位和推理能力。鉴于UI屏幕通常具有更狭长的纵横比，且包含比自然图像更小的兴趣对象（如图标、文本），我们在Ferret基础上引入“任意分辨率”机制以放大细节并利用增强的视觉特征。具体而言，每块屏幕根据原始纵横比被划分为两个子图像（即纵向屏幕水平分割，横向屏幕垂直分割）。两个子图像分别编码后送入大语言模型。我们从大量基础UI任务（如图标识别、文本查找、组件列举）中精心收集训练样本。这些样本采用含区域标注的指令跟随格式，以促进精确的指代与定位。为增强模型推理能力，我们进一步构建了高级任务数据集，包括详细描述、感知/交互对话及功能推理。在精心策划的数据集上训练后，Ferret-UI展现出对UI屏幕的卓越理解能力及执行开放式指令的能力。为进行模型评估，我们构建了涵盖上述所有任务的综合基准。Ferret-UI不仅显著超越大多数开源UI MLLMs，还在所有基础UI任务上优于GPT-4V。