AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild

Jiazheng Sun,Mingxuan Li,Yingying Zhang,Jiayang Niu,Yachen Wu,Ruihan Jin,Shuyu Lei,Pengrongrui Tan,Zongyu Zhang,Ruoyi Wang,Jiachen Yang,Boyu Yang,Jiacheng Liu,Xin Peng

from arxiv, 21 pages, 7 figures

Benchmarks are paramount for gauging progress in the domain of Mobile GUI Agents. In practical scenarios, users frequently fail to articulate precise directives containing full task details at the onset, and their expressions are typically ambiguous. Consequently, agents are required to converge on the user's true intent via active clarification and interaction during execution. However, existing benchmarks predominantly operate under the idealized assumption that user-issued instructions are complete and unequivocal. This paradigm focuses exclusively on assessing single-turn execution while overlooking the alignment capability of the agent. To address this limitation, we introduce AmbiBench, the first benchmark incorporating a taxonomy of instruction clarity to shift evaluation from unidirectional instruction following to bidirectional intent alignment. Grounded in Cognitive Gap theory, we propose a taxonomy of four clarity levels: Detailed, Standard, Incomplete, and Ambiguous. We construct a rigorous dataset of 240 ecologically valid tasks across 25 applications, subject to strict review protocols. Furthermore, targeting evaluation in dynamic environments, we develop MUSE (Mobile User Satisfaction Evaluator), an automated framework utilizing an MLLM-as-a-judge multi-agent architecture. MUSE performs fine-grained auditing across three dimensions: Outcome Effectiveness, Execution Quality, and Interaction Quality. Empirical results on AmbiBench reveal the performance boundaries of SoTA agents across different clarity levels, quantify the gains derived from active interaction, and validate the strong correlation between MUSE and human judgment. This work redefines evaluation standards, laying the foundation for next-generation agents capable of truly understanding user intent.

翻译：基准测试对于衡量移动端图形用户界面智能体领域的发展至关重要。在实际应用场景中，用户往往难以在初始阶段就给出包含完整任务细节的精确指令，其表述通常具有模糊性。因此，智能体需要在执行过程中通过主动澄清与交互来逐步收敛至用户的真实意图。然而，现有基准测试大多基于用户指令完整且明确的理想化假设。这种范式仅关注单轮执行的评估，而忽视了智能体的意图对齐能力。为弥补这一局限，我们提出了AmbiBench，这是首个引入指令清晰度分类法的基准测试，旨在将评估重点从单向的指令跟随转向双向的意图对齐。基于认知鸿沟理论，我们提出了包含四个清晰度层级的分类法：详细级、标准级、不完整级和模糊级。我们构建了一个包含25个应用程序中240项生态效度任务的严谨数据集，并遵循严格的审核流程。此外，针对动态环境下的评估需求，我们开发了MUSE（移动用户满意度评估器），这是一个利用MLLM-as-a-judge多智能体架构的自动化框架。MUSE在三个维度进行细粒度审计：结果有效性、执行质量和交互质量。在AmbiBench上的实证结果揭示了当前最先进智能体在不同清晰度层级下的性能边界，量化了主动交互带来的增益，并验证了MUSE评估结果与人工判断之间的强相关性。本工作重新定义了评估标准，为能够真正理解用户意图的下一代智能体奠定了基础。