WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents

from arxiv, Project page with code, data, demos: https://webshop-pnlp.github.io. v3 is NeurIPS camera ready version. v4 fixes the choice oracle result as per https://github.com/princeton-nlp/WebShop/issues/15

Existing benchmarks for grounding language in interactive environments either lack real-world linguistic elements, or prove difficult to scale up due to substantial human involvement in the collection of data or feedback signals. To bridge this gap, we develop WebShop -- a simulated e-commerce website environment with $1.18$ million real-world products and $12,087$ crowd-sourced text instructions. Given a text instruction specifying a product requirement, an agent needs to navigate multiple types of webpages and issue diverse actions to find, customize, and purchase an item. WebShop provides several challenges for language grounding including understanding compositional instructions, query (re-)formulation, comprehending and acting on noisy text in webpages, and performing strategic exploration. We collect over $1,600$ human demonstrations for the task, and train and evaluate a diverse range of agents using reinforcement learning, imitation learning, and pre-trained image and language models. Our best model achieves a task success rate of $29\%$, which outperforms rule-based heuristics ($9.6\%$) but is far lower than human expert performance ($59\%$). We also analyze agent and human trajectories and ablate various model components to provide insights for developing future agents with stronger language understanding and decision making abilities. Finally, we show that agents trained on WebShop exhibit non-trivial sim-to-real transfer when evaluated on amazon.com and ebay.com, indicating the potential value of WebShop in developing practical web-based agents that can operate in the wild.

翻译：现有的交互式环境语言基础基准要么缺乏真实世界的语言元素，要么因数据或反馈信号收集需要大量人工参与而难以扩展。为弥合这一差距，我们开发了WebShop——一个模拟电子商务网站环境，包含118万真实商品和12087条众包文本指令。给定指定商品需求的文本指令后，智能体需要导航多种类型网页并执行多样化操作，以完成搜索、定制和购买商品。WebShop为语言基础研究提供了多项挑战，包括理解组合式指令、查询（重新）表述、理解网页噪声文本并据此行动、以及执行策略性探索。我们为此任务收集了超过1600条人工演示，并使用强化学习、模仿学习以及预训练图像与语言模型训练并评估了多种智能体。我们最优模型的任务成功率达到29%，优于基于规则的启发式方法（9.6%），但远低于人类专家表现（59%）。我们还分析了智能体与人类的行为轨迹，并通过消融实验研究不同模型组件，为开发具有更强语言理解与决策能力的未来智能体提供见解。最后，我们证明在WebShop上训练的智能体在亚马逊和eBay网站上进行评估时表现出显著的模拟到现实迁移能力，这表明WebShop在开发可在真实环境中运行的实用网络智能体方面具有潜在价值。