We introduce Mind2Web, the first dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action sequences for the tasks, Mind2Web provides three necessary ingredients for building generalist web agents: 1) diverse domains, websites, and tasks, 2) use of real-world websites instead of simulated and simplified ones, and 3) a broad spectrum of user interaction patterns. Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents. While the raw HTML of real-world websites are often too large to be fed to LLMs, we show that first filtering it with a small LM significantly improves the effectiveness and efficiency of LLMs. Our solution demonstrates a decent level of performance, even on websites or entire domains the model has never seen before, but there is still a substantial room to improve towards truly generalizable agents. We open-source our dataset, model implementation, and trained models (https://osu-nlp-group.github.io/Mind2Web) to facilitate further research on building a generalist agent for the web.
翻译:我们提出Mind2Web,这是首个用于开发与评估可遵循语言指令在任意网站完成复杂任务的通用型网络智能体的数据集。现有网络智能体数据集或使用模拟网站,或仅覆盖有限网站与任务类型,难以满足通用型网络智能体的需求。本数据集包含从覆盖31个领域的137个网站收集的2000余项开放式任务,并配有众包标注的动作序列,为构建通用型网络智能体提供了三项关键要素:1)多样化的领域、网站与任务;2)采用真实网站而非模拟简化环境;3)涵盖广泛的用户交互模式。基于Mind2Web,我们初步探索了利用大语言模型构建通用型网络智能体的方法。针对真实网站原始HTML文本过长难以直接输入大语言模型的问题,我们证明先使用小型语言模型对其进行过滤可显著提升大语言模型的效果与效率。该方案即使在模型从未见过的网站或领域上仍展现出可观性能,但构建真正具备泛化能力的智能体仍需大幅改进。我们已开源数据集、模型实现及预训练模型(https://osu-nlp-group.github.io/Mind2Web),以推动通用型网络智能体的进一步研究。