The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB, we improve over the previous best offline methods by more than 45.8%, even outperforming online-finetuned SoTA, humans, and GPT-4-based agent. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. Furthermore, WebGUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.
翻译:自主网络导航的进展一直受制于对在线强化学习中数十亿次探索性交互的依赖,以及难以利用丰富域外数据泛化能力的特定领域模型设计。本研究探索了基于视觉-语言基础模型的网络代理数据驱动离线训练方法。我们提出了一种遵循指令的多模态代理WebGUM,它能同时观察网页截图和HTML页面,并输出点击、输入等网络导航动作。WebGUM通过联合微调指令优化语言模型和具备时空局部感知能力的视觉编码器,在大型示范语料库上进行训练。实验证明,该方案显著提升了代理的 grounding 多模态感知、HTML理解及多步推理能力,性能大幅超越先前方法。在MiniWoB上,我们相较先前最优离线方法提升超过45.8%,甚至超越在线微调的当前最优方法、人类及基于GPT-4的代理。在WebShop基准测试中,我们的30亿参数模型性能优于现有最强模型PaLM-540B。此外,WebGUM在Mind2Web的真实世界规划任务中展现出强大的正向迁移能力。我们使用训练完成的模型收集了34.7万条高质量示范数据(较先前工作扩大38倍),并公开共享以促进该领域的后续研究。