We introduce Mind2Web, the first dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action sequences for the tasks, Mind2Web provides three necessary ingredients for building generalist web agents: 1) diverse domains, websites, and tasks, 2) use of real-world websites instead of simulated and simplified ones, and 3) a broad spectrum of user interaction patterns. Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents. While the raw HTML of real-world websites are often too large to be fed to LLMs, we show that first filtering it with a small LM significantly improves the effectiveness and efficiency of LLMs. Our solution demonstrates a decent level of performance, even on websites or entire domains the model has never seen before, but there is still a substantial room to improve towards truly generalizable agents. We open-source our dataset, model implementation, and trained models (https://osu-nlp-group.github.io/Mind2Web) to facilitate further research on building a generalist agent for the web.
翻译:我们提出Mind2Web,这是首个用于开发和评估通用型网络代理的数据集,能够遵循语言指令在任意网站上完成复杂任务。现有网络代理数据集要么使用模拟网站,要么仅覆盖有限的网站和任务集合,因此不适用于通用型网络代理。Mind2Web包含从跨越31个领域的137个网站收集的2000多个开放式任务,以及众包的任务动作序列,为构建通用型网络代理提供了三个必要要素:1) 多样化的领域、网站和任务;2) 使用真实世界网站而非模拟简化版本;3) 广泛的用户交互模式。基于Mind2Web,我们初步探索了使用大型语言模型(LLM)构建通用型网络代理的方法。尽管真实网站的原始HTML通常过于庞大而无法直接输入LLM,但我们证明先用小型语言模型对其进行过滤,能显著提升LLM的有效性和效率。我们的解决方案展现出可观的性能,即使在模型从未见过的网站或整个领域上也能有效运作,但距离实现真正通用的代理仍有显著提升空间。我们开源了数据集、模型实现和训练模型(https://osu-nlp-group.github.io/Mind2Web),以促进构建通用型网络代理的进一步研究。