AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles,Sarah Clinckemaillie,Yifan Chang,Jonathan Waltz,Gabrielle Lau,Marybeth Fair,Alice Li,William Bishop,Wei Li,Folawiyo Campbell-Ajala,Daniel Toyama,Robert Berry,Divya Tyamagundlu,Timothy Lillicrap,Oriana Riva

Autonomous agents that execute human tasks by controlling computers can enhance human productivity and application accessibility. However, progress in this field will be driven by realistic and reproducible benchmarks. We present AndroidWorld, a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and more realistic suite of tasks. To ensure reproducibility, each task includes dedicated initialization, success-checking, and tear-down logic, which modifies and inspects the device's system state. We experiment with baseline agents to test AndroidWorld and provide initial results on the benchmark. Our best agent can complete 30.6% of AndroidWorld's tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we find to be less effective on mobile, suggesting future research is needed to achieve universal, cross-platform agents. Finally, we also conduct a robustness analysis, showing that task variations can significantly affect agent performance, demonstrating that without such testing, agent performance metrics may not fully reflect practical challenges. AndroidWorld and the experiments in this paper are available at github.com/google-research/android_world.

翻译：通过控制计算机执行人类任务的自主智能体能够提升人类生产力并改善应用程序的可访问性。然而，该领域的进展将取决于现实且可复现的基准测试。我们提出了AndroidWorld，这是一个功能完整的Android环境，为跨越20个真实世界Android应用的116项可编程任务提供奖励信号。与现有提供静态测试集的交互式环境不同，AndroidWorld能够以参数化且用自然语言无限表达的方式动态构建任务，从而能够在更庞大、更贴近现实的任务套件上进行测试。为确保可复现性，每项任务都包含专门的初始化、成功性检查与清理逻辑，这些逻辑会修改并检查设备的系统状态。我们通过基线智能体对AndroidWorld进行测试，并提供了该基准的初步结果。我们表现最佳的智能体能够完成AndroidWorld中30.6%的任务，表明未来仍有广阔的改进空间。此外，我们将一个流行的桌面网页智能体适配至Android平台，发现其在移动环境中的效果较差，这表明未来需要进一步研究以实现通用、跨平台的智能体。最后，我们还进行了鲁棒性分析，结果显示任务变体会显著影响智能体性能，这表明若无此类测试，智能体性能指标可能无法完全反映实际挑战。AndroidWorld及本文中的实验可在github.com/google-research/android_world获取。