Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.
翻译:预训练大语言模型(LLMs)近期在自主网络自动化中展现出更强的泛化能力和样本效率。然而,其在真实网站上的表现仍受限于:(1)开放域性;(2)有限的上下文长度;(3)缺乏对HTML的归纳偏置。我们提出WebAgent——一种由LLM驱动的智能体,能够通过自我经验学习,根据自然语言指令完成真实网站上的任务。WebAgent通过将指令分解为规范子指令进行前瞻性规划,将长HTML文档总结为任务相关片段,并利用基于这些片段生成的Python程序在网站上执行操作。我们为WebAgent设计了Flan-U-PaLM用于接地代码生成,以及HTML-T5——一种针对长HTML文档的新预训练LLMs,采用局部与全局注意力机制及长跨度去噪目标混合策略,用于规划与摘要。实验证明,我们的模块化方案将真实网站的成功率提升超过50%,且HTML-T5是解决各类HTML理解任务的最佳模型:在MiniWoB网络自动化基准测试中,其成功率较先前方法提升18.7%,并在离线任务规划评估Mind2Web上达到最优性能(SoTA)。