GUITrans2Act: Understanding User Operational Behaviors from Mobile GUI Interactions with Vision-Language Models

Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability enables models to convert visual state transitions into operational knowledge, defined as short natural-language sentences that describe action types, target UI elements, textual arguments, and execution orders. However, due to the highly diverse and heterogeneous UI designs across applications, existing vision-language models (VLMs) struggle to accurately infer these underlying operations. To bridge this gap, we introduce Teach VLM, a core model designed to translate mobile screen trajectories into step-wise operational knowledge by extracting and analyzing operation-related keyframes from demonstration videos. To address the scarcity of aligned training data, we develop a systematic data flywheel for scalable data acquisition. We further introduce a novel Chinese Mobile Screen Teach Benchmark for fine-grained evaluation. Building upon Teach VLM, we propose the Teach-and-Repeat paradigm, where the generated operational knowledge serves as an interpretable procedural reference to guide downstream screen-based execution agents. Extensive evaluations demonstrate that Teach VLM significantly outperforms strong VLM baselines, achieving state-of-the-art performance in operation semantics prediction. Furthermore, experiments in Android World show that our paradigm yields consistent Task Success Rate improvements for downstream agents. Together, Teach VLM and the Teach-and-Repeat paradigm offer a practical pathway from raw demonstrations to reusable task automation.

翻译：理解移动设备上的数字世界正在从静态的UI感知转向动态的操作语义理解。该能力使模型能够将视觉状态转变转化为可操作知识，定义为描述操作类型、目标UI元素、文本参数及执行顺序的简短自然语言句子。然而，由于不同应用中UI设计的多样性和异构性，现有视觉-语言模型难以准确推断这些潜在操作。为解决这一问题，我们提出核心模型Teach VLM，通过从示范视频中提取并分析与操作相关的关键帧，将移动屏幕轨迹转化为逐步操作知识。针对对齐训练数据稀缺问题，我们开发了系统性数据飞轮以实现可扩展的数据采集，并进一步引入新颖的中文移动屏幕教学基准用于细粒度评估。基于Teach VLM，我们提出"教与复现"范式，将生成的操作知识作为可解释的过程参考，指导下游基于屏幕的执行代理。大量实验表明，Teach VLM显著优于强基线视觉-语言模型，在操作语义预测中达到最先进性能。此外，在Android World中的实验显示，该范式持续提升了下游代理的任务成功率。通过Teach VLM与"教与复现"范式，我们提供了从原始示范到可复用任务自动化的实用路径。