Context-Aware Workflow Decomposition for Automated Mobile UI Annotation Using Multimodal Large Language Models

Accurate mobile user interface annotation is important for UI understanding, accessibility tools, automated testing, dataset construction, and GUI agents. However, mobile screens are difficult to annotate because they often contain small, dense, nested, and visually ambiguous elements. Multimodal large language models can help automate this process, but their outputs are sensitive to prompt design and the organization of annotation tasks. This paper studies automated mobile UI annotation from a workflow design perspective, focusing on improving annotation precision. Rather than asking the model to annotate all UI elements in a single step, the task is divided into smaller context-aware stages, allowing related UI elements to be handled with clearer instructions and useful screen context. The proposed pipeline uses structured prompts, schema-constrained JSON outputs, and element-specific annotation instructions. Experiments are conducted on expert-annotated mobile UI screens from the MUIAnno dataset, using eight common UI element types: button, tab, clickable text, card, label, plain text, icon, and image. Four workflow strategies are evaluated: one-step, two-step, four-step, and eight-step annotation. Results show that the two-step workflow achieves the highest precision, while deeper decomposition improves recall but produces more false positives. Additional grouping experiments show that annotation quality depends on both workflow depth and element-class grouping. Overall, careful workflow design can make LLM-based mobile UI annotation more reliable for UI understanding, dataset construction, and GUI agent development.

翻译：精确的移动用户界面标注对于界面理解、无障碍工具、自动化测试、数据集构建以及图形用户界面代理至关重要。然而，移动屏幕由于包含密集、嵌套且视觉模糊的小尺寸元素，标注难度较大。多模态大语言模型可协助自动化这一过程，但其输出结果对提示设计及标注任务的组织方式较为敏感。本文从工作流设计视角出发，研究自动化移动界面标注，重点关注提升标注精度。通过将标注任务分解为多个情境感知的小型阶段，而非要求模型在单一步骤中完成所有界面元素的标注，使得相关界面元素能够依托更清晰的指令与有效的屏幕上下文进行处理。所提出的流水线采用结构化提示、模式约束的JSON输出以及元素特定标注指令。实验基于MUIAnno数据集中经专家标注的移动界面屏幕，涵盖八种常见界面元素类型：按钮、标签页、可点击文本、卡片、标签、纯文本、图标及图像。我们评估了四种工作流策略：一步、两步、四步及八步标注。结果显示，两步工作流实现了最高精确率，而更深度的分解虽提高了召回率，但产生了更多误检。额外分组实验表明，标注质量同时取决于工作流深度与元素类别分组。总体而言，精细的工作流设计可提升基于大语言模型的移动界面标注在界面理解、数据集构建及图形用户界面代理开发中的可靠性。