Language agents based on large language models (LLMs) have demonstrated great promise in automating web-based tasks. Recent work has shown that incorporating advanced planning algorithms, e.g., tree search, is advantageous over reactive planning for web agents. However, unlike simulated sandbox environments, real-world environments such as the web are rife with irreversible actions. This undermines the feasibility of backtracking, a cornerstone of (tree) search. Overly relying on test-time search also hurts efficiency. We advocate model-based planning for web agents that employs a world model to simulate and deliberate over the outcome of each candidate action before committing to one. We systematically explore this paradigm by (1) Proposing a model-based planning framework, WebDreamer, which employs LLMs to serve as both world models and value functions; (2) Training specialized LLMs as world models with a scalable data synthesis pipeline. Empirical results demonstrate that WebDreamer achieves substantial performance improvements over reactive baselines. It is competitive, while being 4-5 times more efficient, with tree search in sandbox environments (VisualWebArena) and also works effectively on real-world websites (Online-Mind2Web and Mind2Web-Live). Furthermore, our trained world model, Dreamer-7B, performs comparable to GPT-4o, highlighting the potential of specialized world models for efficient and effective planning in complex web environments.
翻译:基于大型语言模型(LLM)的语言智能体在自动化网络任务方面展现出巨大潜力。近期研究表明,对于网络智能体而言,融入高级规划算法(如树搜索)相比反应式规划更具优势。然而,与模拟沙盒环境不同,现实世界环境(如互联网)充斥着大量不可逆操作。这动摇了回溯(树搜索的基石)的可行性。过度依赖测试时搜索也会损害效率。我们提倡为网络智能体采用基于模型的规划方法,该方法在确定执行某个候选动作前,会利用世界模型来模拟并推演其可能结果。我们通过以下方式系统性地探索这一范式:(1)提出基于模型的规划框架WebDreamer,该框架利用LLM同时作为世界模型和价值函数;(2)通过可扩展的数据合成流程训练专用LLM作为世界模型。实验结果表明,WebDreamer相较于反应式基线方法取得了显著的性能提升。在沙盒环境(VisualWebArena)中,其性能与树搜索相当,同时效率提高4-5倍;在真实网站(Online-Mind2Web与Mind2Web-Live)上也表现出色。此外,我们训练的世界模型Dreamer-7B性能与GPT-4o相当,这凸显了专用世界模型在复杂网络环境中实现高效规划的巨大潜力。