With the rapid advancement of large language models (LLMs), mobile agents have emerged as promising tools for phone automation, simulating human interactions on screens to accomplish complex tasks. However, these agents often suffer from low accuracy, misinterpretation of user instructions, and failure on challenging tasks, with limited prior work examining why and where they fail. To address this, we introduce DailyDroid, a benchmark of 75 tasks in five scenarios across 25 Android apps, spanning three difficulty levels to mimic everyday smartphone use. We evaluate it using text-only and multimodal (text + screenshot) inputs on GPT-4o and o4-mini across 300 trials, revealing comparable performance with multimodal inputs yielding marginally higher success rates. Through in-depth failure analysis, we compile a handbook of common failures. Our findings reveal critical issues in UI accessibility, input modalities, and LLM/app design, offering implications for future mobile agents, applications, and UI development.
翻译:随着大语言模型的快速发展,移动智能体已逐渐成为实现手机自动化的有前景工具,通过模拟人类在屏幕上的交互来完成复杂任务。然而,这些智能体常存在准确率低、用户指令理解偏差以及挑战性任务失败等问题,而现有研究对其失败原因与场景的探讨十分有限。为此,我们提出DailyDroid基准测试,涵盖25个安卓应用中五个场景的75个任务,包含三个难度等级以模拟日常智能手机使用。我们采用纯文本与多模态(文本+截图)输入方式,在GPT-4o和o4-mini模型上开展300次实验,结果显示两种输入模式性能相当,多模态输入的成功率略高。通过深入的失败分析,我们汇编了常见失败模式手册。研究揭示了界面无障碍性、输入模态及大语言模型/应用设计中的关键问题,为未来移动智能体、应用及界面开发提供了重要启示。