AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of large language models (LLMs). Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous studies have introduced agents, datasets, and benchmarks to advance mobile GUI agent research, many existing datasets focus on static frame evaluations and fail to provide a comprehensive platform for assessing performance on real-world, in-the-wild tasks. To address this gap, we present Android Agent Arena (A3), a novel evaluation platform. Unlike existing in-the-wild systems, A3 offers: (1) meaningful and practical tasks, such as real-time online information retrieval and operational instructions; (2) a larger, more flexible action space, enabling compatibility with agents trained on any dataset; and (3) automated business-level LLM-based evaluation process. A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios, providing a robust foundation for evaluating mobile GUI agents in real-world situations and a new autonomous evaluation process for less human labor and coding expertise. The project is available at \url{https://yuxiangchai.github.io/Android-Agent-Arena/}.
翻译:近年来,随着大语言模型(LLMs)领域的重大进展,AI智能体已变得越来越普遍。移动GUI智能体作为AI智能体的一个子集,旨在移动设备上自主执行任务。尽管已有大量研究引入了智能体、数据集和基准测试以推动移动GUI智能体研究,但许多现有数据集侧重于静态帧评估,未能提供一个全面的平台来评估智能体在真实世界、开放环境任务中的性能。为弥补这一不足,我们提出了安卓智能体竞技场(A3),这是一个新颖的评估平台。与现有的开放环境系统不同,A3提供:(1)有意义且实用的任务,例如实时在线信息检索和操作指令执行;(2)更大、更灵活的动作空间,使其能够与在任何数据集上训练的智能体兼容;以及(3)自动化的、基于LLM的业务级评估流程。A3包含了21个广泛使用的通用第三方应用程序和201个代表常见用户场景的任务,为在真实情境下评估移动GUI智能体提供了坚实的基础,并为减少人工劳动和编码专业知识需求提供了一种新的自主评估流程。该项目可通过 \url{https://yuxiangchai.github.io/Android-Agent-Arena/} 访问。