The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision transformer on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded visual perception, HTML comprehension and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB benchmark, we improve over the previous best offline methods by more than 31.9%, being close to reaching online-finetuned SoTA. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.
翻译:自主网页导航的进展一直受限于依赖在线强化学习的数十亿次探索交互,以及难以充分利用丰富域外数据泛化能力的特定领域模型设计。本研究探索了基于视觉-语言基础模型的网页代理数据驱动离线训练方法。我们提出了一种遵循指令的多模态代理WebGUM,该代理能同时观察网页截图和HTML页面,并输出点击、输入等网页导航动作。WebGUM通过在大规模演示语料库上联合微调指令调优语言模型和视觉Transformer进行训练。实验证明,该方法显著提升了代理在具身视觉感知、HTML理解与多步推理方面的能力,较先前工作取得大幅提升。在MiniWoB基准测试中,我们较先前最优离线方法提升超过31.9%,性能接近在线微调的最新技术水平。在WebShop基准测试中,我们的30亿参数模型超越了现有最优模型PaLM-540B。本研究还利用训练模型收集了347K高质量演示数据(较先前工作扩大38倍),并公开以推动该方向的后续研究。