LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Task Automation

The emergent large language/multimodal models facilitate the evolution of mobile agents, especially in mobile UI task automation. However, existing evaluation approaches, which rely on human validation or established datasets to compare agent-predicted actions with predefined action sequences, are unscalable and unfaithful. To overcome these limitations, this paper presents LlamaTouch, a testbed for on-device mobile UI task execution and faithful, scalable task evaluation. By observing that the task execution process only transfers UI states, LlamaTouch employs a novel evaluation approach that only assesses whether an agent traverses all manually annotated, essential application/system states. LlamaTouch comprises three key techniques: (1) On-device task execution that enables mobile agents to interact with realistic mobile environments for task execution. (2) Fine-grained UI component annotation that merges pixel-level screenshots and textual screen hierarchies to explicitly identify and precisely annotate essential UI components with a rich set of designed annotation primitives. (3) A multi-level application state matching algorithm that utilizes exact and fuzzy matching to accurately detect critical information in each screen, even with unpredictable UI layout/content dynamics. LlamaTouch currently incorporates four mobile agents and 496 tasks, encompassing both tasks in the widely-used datasets and our self-constructed ones to cover more diverse mobile applications. Evaluation results demonstrate LlamaTouch's high faithfulness of evaluation in real-world mobile environments and its better scalability than human validation. LlamaTouch also enables easy task annotation and integration of new mobile agents. Code and dataset are publicly available at https://github.com/LlamaTouch/LlamaTouch.

翻译：新兴的大型语言/多模态模型推动了移动智能体的发展，尤其在移动界面任务自动化领域。然而，现有的评估方法依赖于人工验证或固定数据集，通过对比智能体预测动作与预定义动作序列进行评估，这种方式既不可扩展也不可靠。为克服这些限制，本文提出了LlamaTouch——一个支持设备端移动界面任务执行且具备可靠、可扩展任务评估能力的测试平台。通过观察到任务执行过程仅涉及界面状态转移，LlamaTouch采用了一种创新的评估方法，仅评估智能体是否遍历所有人工标注的关键应用/系统状态。LlamaTouch包含三项核心技术：（1）设备端任务执行技术，使移动智能体能够在真实移动环境中交互并执行任务；（2）细粒度界面组件标注技术，融合像素级截图与文本化屏幕层级结构，通过设计丰富的标注原语显式识别并精准标注关键界面组件；（3）多层级应用状态匹配算法，结合精确匹配与模糊匹配，即使面对不可预测的界面布局/内容动态变化，也能准确检测每屏中的关键信息。LlamaTouch目前已集成四种移动智能体与496项任务，涵盖广泛使用的数据集任务及我们自主构建的任务，以覆盖更多样化的移动应用。评估结果表明，LlamaTouch在真实移动环境中具有高评估可靠性，且相比人工验证具备更优的可扩展性。该平台还支持便捷的任务标注与新移动智能体的集成。代码与数据集已公开于 https://github.com/LlamaTouch/LlamaTouch。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日