Recent advances in mobile Graphical User Interface (GUI) agents highlight the growing need for comprehensive evaluation benchmarks. While new online benchmarks offer more realistic testing than offline ones, they tend to focus on the agents' task instruction-following ability while neglecting their reasoning and exploration ability. Moreover, these benchmarks do not consider the random noise in real-world mobile environments. This leads to a gap between benchmarks and real-world environments. To addressing these limitations, we propose MobileBench-OL, an online benchmark with 1080 tasks from 80 Chinese apps. It measures task execution, complex reasoning, and noise robustness of agents by including 5 subsets, which set multiple evaluation dimensions. We also provide an auto-eval framework with a reset mechanism, enabling stable and repeatable real-world benchmarking. Evaluating 12 leading GUI agents on MobileBench-OL shows significant room for improvement to meet real-world requirements. Human evaluation further confirms that MobileBench-OL can reliably measure the performance of leading GUI agents in real environments. Our data and code will be released upon acceptance.
翻译:移动图形用户界面(GUI)智能体的最新进展凸显了对全面评估基准日益增长的需求。虽然新的在线基准比离线基准提供了更真实的测试,但它们往往侧重于智能体的任务指令遵循能力,而忽视了其推理和探索能力。此外,这些基准未考虑真实世界移动环境中的随机噪声。这导致了基准与真实世界环境之间的差距。为了应对这些局限性,我们提出了MobileBench-OL,这是一个包含来自80个中文应用的1080个任务的在线基准。它通过包含5个子集来设定多个评估维度,从而衡量智能体的任务执行、复杂推理和噪声鲁棒性。我们还提供了一个带有重置机制的自动评估框架,实现了稳定且可重复的真实世界基准测试。在MobileBench-OL上对12个领先的GUI智能体进行评估的结果表明,要满足真实世界的要求仍有显著的提升空间。人工评估进一步证实,MobileBench-OL能够可靠地衡量领先GUI智能体在真实环境中的性能。我们的数据和代码将在论文被接受后发布。