Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.
翻译:预训练大语言模型(LLMs)近年来在自主网页自动化任务中展现出更强的泛化能力和样本效率。然而,在真实网站上的表现仍然受限于三个问题:(1)开放域特性;(2)有限的上下文长度;(3)对HTML结构缺乏归纳偏置。我们提出WebAgent——一种由LLM驱动的智能体,能够通过自我经验学习,根据自然语言指令在真实网站上完成任务。WebAgent通过将指令分解为规范化的子指令进行前瞻性规划,将长HTML文档压缩为任务相关片段,并借助生成的Python程序在网站上执行操作。我们设计了基于Flan-U-PaLM的WebAgent用于接地代码生成,以及HTML-T5——一种针对长HTML文档采用局部与全局注意力机制及长跨度去噪混合目标进行预训练的新LLM——用于规划与摘要生成。实验表明,我们的模块化方案将真实网站任务成功率提升超过50%,且HTML-T5在多种HTML理解任务中表现最优:在MiniWoB网页自动化基准测试中相比先前方法成功率提升18.7%,并在离线任务规划评估Mind2Web上达到目前最优性能。