Graphical User Interfaces (GUIs) are central to human-computer interaction, yet automating complex GUI tasks remains a major challenge for autonomous agents, largely due to a lack of scalable, high-quality training data. While recordings of human demonstrations offer a rich data source, they are typically long, unstructured, and lack annotations, making them difficult for agents to learn from.To address this, we introduce ShowUI-Aloha, a comprehensive pipeline that transforms unstructured, in-the-wild human screen recordings from desktop environments into structured, actionable tasks. Our framework includes four key components: A recorder that captures screen video along with precise user interactions like mouse clicks, keystrokes, and scrolls. A learner that semantically interprets these raw interactions and the surrounding visual context, translating them into descriptive natural language captions. A planner that reads the parsed demonstrations, maintains task states, and dynamically formulates the next high-level action plan based on contextual reasoning. An executor that faithfully carries out these action plans at the OS level, performing precise clicks, drags, text inputs, and window operations with safety checks and real-time feedback. Together, these components provide a scalable solution for collecting and parsing real-world human data, demonstrating a viable path toward building general-purpose GUI agents that can learn effectively from simply observing humans.
翻译:图形用户界面(GUI)是人机交互的核心,然而自动化复杂的GUI任务仍然是自主智能体面临的主要挑战,这很大程度上源于缺乏可扩展的高质量训练数据。虽然人类操作记录提供了丰富的数据源,但这些记录通常冗长、非结构化且缺乏标注,使得智能体难以从中学习。为解决这一问题,我们提出了ShowUI-Aloha——一个将桌面环境中非结构化的真实人类屏幕记录转化为结构化可执行任务的完整流程。我们的框架包含四个核心组件:记录器,用于捕获屏幕视频及精确的用户交互(如鼠标点击、键盘输入和滚动操作);学习器,通过语义理解原始交互行为及其视觉上下文,将其转化为描述性自然语言标注;规划器,读取解析后的示范记录,维护任务状态,并基于上下文推理动态制定下一步高层动作计划;执行器,在操作系统层面忠实执行这些动作计划,通过安全检查与实时反馈实现精确点击、拖拽、文本输入及窗口操作。这些组件共同构成了收集与解析真实人类数据的可扩展解决方案,为构建能够通过观察人类行为进行高效学习的通用GUI智能体提供了可行路径。