The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.
翻译:由大型语言模型(LLM)和强化学习(RL)驱动的自主网页智能体的发展,代表了迈向通用人工智能助手的重要一步。然而,与实时互联网交互的挑战严重阻碍了这些智能体的训练,这种交互方式效率低下、成本高昂且充满风险。基于模型的强化学习(MBRL)通过学习环境的世界模型以实现模拟交互,提供了一种有前景的解决方案。本文介绍了DynaWeb,一种新颖的MBRL框架,它通过让网页智能体与一个经过训练的网页世界模型进行交互来训练智能体,该模型旨在给定智能体动作的情况下预测自然的网页表征。该模型作为一个合成的网页环境,智能体策略可以在此环境中通过生成大量的推演动作轨迹来进行"梦想",从而实现高效的在线强化学习。除了自由的策略推演,DynaWeb还整合了来自训练数据的真实专家轨迹,这些轨迹在训练过程中与策略推演随机交织,以提高训练的稳定性和样本效率。在具有挑战性的WebArena和WebVoyager基准测试上进行的实验表明,DynaWeb能够持续且显著地提升最先进的开源网页智能体模型的性能。我们的研究结果确立了通过"想象"训练网页智能体的可行性,为扩展在线智能体强化学习提供了一种可扩展且高效的方法。