Autonomous agents that execute human tasks by controlling computers can enhance human productivity and application accessibility. Yet, progress in this field will be driven by realistic and reproducible benchmarks. We present AndroidWorld, a fully functioning Android environment that provides reward signals for 116 programmatic task workflows across 20 real world Android applications. Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and realistic suite of tasks. Reward signals are derived from the computer's system state, making them durable across task variations and extensible across different apps. To demonstrate AndroidWorld's benefits and mode of operation, we introduce a new computer control agent, M3A. M3A can complete 30.6% of the AndroidWorld's tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we find to be less effective on mobile, suggesting future research is needed to achieve universal, cross-domain agents. Finally, we conduct a robustness analysis by testing M3A against a range of task variations on a representative subset of tasks, demonstrating that variations in task parameters can significantly alter the complexity of a task and therefore an agent's performance, highlighting the importance of testing agents under diverse conditions. AndroidWorld and the experiments in this paper are available at https://github.com/google-research/android_world.
翻译:通过控制计算机执行人类任务的自主智能体能够提升人类生产力与应用程序的可访问性。然而,该领域的进展将取决于现实且可复现的基准测试。我们提出了AndroidWorld,这是一个功能完整的Android环境,为跨越20个真实世界Android应用程序的116个程序化任务流程提供奖励信号。与现有提供静态测试集的交互式环境不同,AndroidWorld能够以无限种方式动态构建参数化并用自然语言描述的任务,从而能够在更庞大且更真实的任务套件上进行测试。奖励信号源自计算机的系统状态,使其在不同任务变体中具有持久性,并可跨不同应用程序扩展。为展示AndroidWorld的优势与运行模式,我们引入了一种新型计算机控制智能体M3A。M3A能够完成AndroidWorld中30.6%的任务,这为未来研究留下了充足空间。此外,我们将一种流行的桌面网页智能体适配至Android平台,发现其在移动端效果欠佳,这表明未来需要研究以实现通用的跨领域智能体。最后,我们通过在一组代表性任务上测试M3A应对多种任务变体的表现进行了鲁棒性分析,结果表明任务参数的变动会显著改变任务复杂度并影响智能体性能,这突显了在多样化条件下测试智能体的重要性。AndroidWorld及本文实验代码发布于https://github.com/google-research/android_world。