Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.
翻译:预训练大语言模型(LLMs)近期在自主网络自动化方面展现了更强的泛化能力和样本效率。然而,其在现实网站上的表现仍受限于以下问题:(1)开放域特性,(2)有限的上下文长度,(3)缺乏对HTML的归纳偏置。我们提出WebAgent——一种基于LLM的代理,通过自我经验学习,遵循自然语言指令完成真实网站上的任务。WebAgent通过将指令分解为标准子指令进行前瞻性规划,将长HTML文档总结为任务相关片段,并通过生成的Python程序在网站上执行操作。我们利用Flan-U-PaLM(用于基础代码生成)和HTML-T5(新预训练LLM)设计WebAgent,其中HTML-T5采用局部与全局注意力机制及长跨度去噪混合目标,专门处理长HTML文档的规划与摘要任务。实验表明,我们的模块化方法使真实网站任务成功率提升超过50%,且HTML-T5在多种HTML理解任务中表现最优:在MiniWoB网络自动化基准测试中,其成功率较先前方法提升18.7%;在离线任务规划评估Mind2Web中达到当前最优性能。