Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web navigation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that can complete the tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via generated Python programs from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our recipe improves the success on a real website by over 50%, and that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9% higher success rate than prior SoTA on the MiniWoB web navigation benchmark and better accuracy on offline task planning evaluation.
翻译:预训练大型语言模型(LLMs)近期在自主网页导航任务中展现出更优的泛化能力和样本效率。然而,在真实网站上的性能仍受限于:(1)开放域特性;(2)有限上下文长度;(3)HTML归纳偏置的缺失。我们提出WebAgent——一种由LLM驱动的智能体,能够遵循自然语言指令在真实网站上完成各项任务。该智能体通过将指令分解为标准子指令进行前瞻规划,将长HTML文档摘要为任务相关片段,并利用自动生成的Python程序在网站上执行操作。我们采用Flan-U-PaLM实现基于上下文的代码生成,同时设计HTML-T5——一种采用局部与全局注意力机制及长跨度去噪混合目标进行预训练的新型LLM,专门用于长HTML文档的规划与摘要任务。实验证明,本方案使真实网站任务成功率提升超过50%,且HTML-T5在基于HTML的任务中表现最优:在MiniWoB网页导航基准上相较此前最优模型(SoTA)实现14.9%的成功率提升,并在离线任务规划评估中取得更优准确率。