Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.
翻译:近年来,大型语言模型(LLM)的进展使得自主智能体能够执行需要与工具和环境进行多轮交互的复杂任务。然而,此类智能体训练的规模化受到缺乏多样且可靠环境的限制。本文提出智能体世界模型(AWM),一种完全合成的环境生成流程。利用该流程,我们规模化构建了涵盖日常场景的1,000个环境,智能体可在其中与丰富的工具集(平均每个环境35个工具)交互并获得高质量观测。值得注意的是,这些环境由代码驱动并由数据库支持,相比LLM模拟的环境能提供更可靠、一致的状态转移。此外,与从真实环境中收集轨迹相比,它们能实现更高效的智能体交互。为验证该资源的有效性,我们针对多轮工具使用智能体进行了大规模强化学习训练。得益于完全可执行的环境和可访问的数据库状态,我们还能设计可靠的奖励函数。在三个基准测试上的实验表明,仅在合成环境中训练(而非特定于基准的环境)能产生强大的分布外泛化能力。代码发布于 https://github.com/Snowflake-Labs/agent-world-model。