Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls. To remedy this gap, we built $\textbf{AppWorld Engine}$, a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created $\textbf{AppWorld Benchmark}$ (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes, i.e., collateral damage. The state-of-the-art LLM, GPT-4o, solves only ~49% of our 'normal' tasks and ~30% of 'challenge' tasks, while other models solve at least 16% fewer. This highlights the benchmark's difficulty and AppWorld's potential to push the frontiers of interactive coding agents. The project website is available at https://appworld.dev/.
翻译:能够处理日常数字任务(例如为家庭订购杂货)的自主智能体,不仅需要通过API操作多个应用程序(例如笔记、消息、购物应用),还必须基于与环境的交互,以迭代方式生成具有复杂控制流的丰富代码。然而,现有的工具使用基准测试存在不足,因为它们仅涵盖需要简单API调用序列的任务。为了弥补这一差距,我们构建了$\textbf{AppWorld Engine}$,这是一个高质量的执行环境(包含6万行代码),包含9个可通过457个API操作的日常应用,并填充了模拟约100个虚构用户生活的逼真数字活动。随后,我们创建了$\textbf{AppWorld Benchmark}$(包含4万行代码),这是一套包含750个自然、多样且具有挑战性的自主智能体任务集,要求进行丰富且交互式的代码生成。它支持基于状态的单元测试进行鲁棒的程序化评估,允许以不同方式完成任务,同时检查意外更改,即附带损害。最先进的大型语言模型GPT-4o仅能解决我们约49%的“普通”任务和约30%的“挑战”任务,而其他模型的解决率至少低16%。这突显了该基准测试的难度以及AppWorld在推动交互式编码智能体前沿研究方面的潜力。项目网站地址为 https://appworld.dev/。