Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts and to complete daily tasks. However, existing online benchmarks struggle with obtaining stable reward signals due to dynamic environmental changes. Offline benchmarks evaluate the agents through single-path trajectories, which stands in contrast to the inherently multi-solution characteristics of GUI tasks. Additionally, both types of benchmarks fail to assess whether mobile agents can handle noise or engage in proactive interactions due to a lack of noisy apps or overly full instructions during the evaluation process. To address these limitations, we use a slot-based instruction generation method to construct a more realistic and comprehensive benchmark named Mobile-Bench-v2. Mobile-Bench-v2 includes a common task split, with offline multi-path evaluation to assess the agent's ability to obtain step rewards during task execution. It contains a noisy split based on pop-ups and ads apps, and a contaminated split named AITZ-Noise to formulate a real noisy environment. Furthermore, an ambiguous instruction split with preset Q\&A interactions is released to evaluate the agent's proactive interaction capabilities. We conduct evaluations on these splits using the single-agent framework AppAgent-v1, the multi-agent framework Mobile-Agent-v2, as well as other mobile agents such as UI-Tars and OS-Atlas. Code and data are available at https://huggingface.co/datasets/xwk123/MobileBench-v2.

翻译：基于视觉语言模型（VLM）的移动智能体因其能够与智能手机图形用户界面（GUI）和XML结构化文本交互并完成日常任务而日益受到关注。然而，现有的在线基准测试因动态环境变化而难以获得稳定的奖励信号。离线基准测试则通过单一路径轨迹评估智能体，这与GUI任务固有的多解特性相悖。此外，由于评估过程中缺乏含噪声的应用程序或指令过于详尽，这两类基准测试均未能有效评估移动智能体处理噪声或进行主动交互的能力。为应对这些局限性，我们采用基于槽位的指令生成方法，构建了一个更真实、更全面的基准测试，命名为Mobile-Bench-v2。Mobile-Bench-v2包含一个通用任务划分，通过离线多路径评估来检验智能体在执行任务过程中获取步骤奖励的能力。它设有一个基于弹窗和广告应用程序的噪声划分，以及一个名为AITZ-Noise的污染划分，以模拟真实的噪声环境。此外，我们还发布了一个带有预设问答交互的模糊指令划分，用于评估智能体的主动交互能力。我们使用单智能体框架AppAgent-v1、多智能体框架Mobile-Agent-v2，以及其他移动智能体（如UI-Tars和OS-Atlas）对这些划分进行了评估。代码与数据可在 https://huggingface.co/datasets/xwk123/MobileBench-v2 获取。