The emergent large language/multimodal models facilitate the evolution of mobile agents, especially in mobile UI task automation. However, existing evaluation approaches, which rely on human validation or established datasets to compare agent-predicted actions with predefined action sequences, are unscalable and unfaithful. To overcome these limitations, this paper presents LlamaTouch, a testbed for on-device mobile UI task execution and faithful, scalable task evaluation. By observing that the task execution process only transfers UI states, LlamaTouch employs a novel evaluation approach that only assesses whether an agent traverses all manually annotated, essential application/system states. LlamaTouch comprises three key techniques: (1) On-device task execution that enables mobile agents to interact with realistic mobile environments for task execution. (2) Fine-grained UI component annotation that merges pixel-level screenshots and textual screen hierarchies to explicitly identify and precisely annotate essential UI components with a rich set of designed annotation primitives. (3) A multi-level application state matching algorithm that utilizes exact and fuzzy matching to accurately detect critical information in each screen, even with unpredictable UI layout/content dynamics. LlamaTouch currently incorporates four mobile agents and 496 tasks, encompassing both tasks in the widely-used datasets and our self-constructed ones to cover more diverse mobile applications. Evaluation results demonstrate LlamaTouch's high faithfulness of evaluation in real-world mobile environments and its better scalability than human validation. LlamaTouch also enables easy task annotation and integration of new mobile agents. Code and dataset are publicly available at https://github.com/LlamaTouch/LlamaTouch.
翻译:新兴的大型语言/多模态模型推动了移动智能体的发展,尤其在移动界面任务自动化领域。然而,现有的评估方法依赖于人工验证或固定数据集,通过对比智能体预测动作与预定义动作序列进行评估,这种方式既不可扩展也不可靠。为克服这些限制,本文提出了LlamaTouch——一个支持设备端移动界面任务执行且具备可靠、可扩展任务评估能力的测试平台。通过观察到任务执行过程仅涉及界面状态转移,LlamaTouch采用了一种创新的评估方法,仅评估智能体是否遍历所有人工标注的关键应用/系统状态。LlamaTouch包含三项核心技术:(1)设备端任务执行技术,使移动智能体能够在真实移动环境中交互并执行任务;(2)细粒度界面组件标注技术,融合像素级截图与文本化屏幕层级结构,通过设计丰富的标注原语显式识别并精准标注关键界面组件;(3)多层级应用状态匹配算法,结合精确匹配与模糊匹配,即使面对不可预测的界面布局/内容动态变化,也能准确检测每屏中的关键信息。LlamaTouch目前已集成四种移动智能体与496项任务,涵盖广泛使用的数据集任务及我们自主构建的任务,以覆盖更多样化的移动应用。评估结果表明,LlamaTouch在真实移动环境中具有高评估可靠性,且相比人工验证具备更优的可扩展性。该平台还支持便捷的任务标注与新移动智能体的集成。代码与数据集已公开于 https://github.com/LlamaTouch/LlamaTouch。