SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation

Jingxuan Chen,Derek Yuen,Bin Xie,Yuhao Yang,Gongwei Chen,Zhihao Wu,Li Yixing,Xurui Zhou,Weiwen Liu,Shuai Wang,Kaiwen Zhou,Rui Shao,Liqiang Nie,Yasheng Wang,Jianye Hao,Jun Wang,Kun Shao

from arxiv, ICLR 2025 Spotlight

Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)-based approaches emerging as key contenders. Fairly comparing these agents is essential but challenging, requiring a varied task scope, the integration of agents with different implementations, and a generalisable evaluation pipeline to assess their strengths and weaknesses. In this paper, we present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents in an interactive environment that simulates real-world conditions. SPA-Bench offers three key contributions: (1) A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines; (2) A plug-and-play framework enabling real-time agent interaction with Android devices, integrating over ten agents with the flexibility to add more; (3) A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption. Our extensive experiments across tasks and agents reveal challenges like interpreting mobile user interfaces, action grounding, memory retention, and execution costs. We propose future research directions to ease these difficulties, moving closer to real-world smartphone agent applications. SPA-Bench is available at https://ai-agents-2030.github.io/SPA-Bench/.

翻译：智能手机智能体在帮助用户高效控制设备方面日益重要，其中基于（多模态）大语言模型（MLLM）的方法已成为主要竞争者。公平比较这些智能体至关重要但也充满挑战，这需要多样化的任务范围、整合不同实现方式的智能体，以及一个可泛化的评估流程来评估其优势与不足。本文提出SPA-Bench，一个全面的智能手机智能体基准，旨在模拟真实世界条件的交互环境中评估基于（多模态）大语言模型的智能体。SPA-Bench提供三个核心贡献：（1）涵盖系统应用与第三方应用的多样化任务集，支持中英文双语，聚焦于日常高频使用功能；（2）即插即用框架支持智能体与安卓设备实时交互，已集成十余种智能体并具备灵活扩展能力；（3）创新的自动化评估流程，通过任务完成度与资源消耗相关的七项指标，在多维度上系统评估智能体性能。我们在多任务与多智能体上的大规模实验揭示了移动用户界面理解、动作定位、记忆保持及执行成本等挑战，并提出了缓解这些难点的未来研究方向，以推动智能手机智能体向实际应用迈进。SPA-Bench项目地址：https://ai-agents-2030.github.io/SPA-Bench/。