Multimodal Web Navigation with Instruction-Finetuned Foundation Models

The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision transformer on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded visual perception, HTML comprehension and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB benchmark, we improve over the previous best offline methods by more than 31.9%, being close to reaching online-finetuned SoTA. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.

翻译：自主网页导航的进展一直受限于依赖在线强化学习的数十亿次探索交互，以及难以充分利用丰富域外数据泛化能力的特定领域模型设计。本研究探索了基于视觉-语言基础模型的网页代理数据驱动离线训练方法。我们提出了一种遵循指令的多模态代理WebGUM，该代理能同时观察网页截图和HTML页面，并输出点击、输入等网页导航动作。WebGUM通过在大规模演示语料库上联合微调指令调优语言模型和视觉Transformer进行训练。实验证明，该方法显著提升了代理在具身视觉感知、HTML理解与多步推理方面的能力，较先前工作取得大幅提升。在MiniWoB基准测试中，我们较先前最优离线方法提升超过31.9%，性能接近在线微调的最新技术水平。在WebShop基准测试中，我们的30亿参数模型超越了现有最优模型PaLM-540B。本研究还利用训练模型收集了347K高质量演示数据（较先前工作扩大38倍），并公开以推动该方向的后续研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/