WebCanvas: Benchmarking Web Agents in Online Environments

from arxiv, Our platform, tool and dataset are publically available at https://www.imean.ai/web-canvas/ and https://huggingface.co/datasets/iMeanAI/Mind2Web-Live/

For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research.

翻译：为使网络智能体具备实际应用价值，其必须适应持续演变的网络环境，该环境以用户界面与内容的频繁更新为特征。然而，现有基准测试大多仅捕捉网络的静态特性。为弥补这一差距，我们提出了WebCanvas——一个创新的网络智能体在线评估框架，能有效应对网络交互的动态特性。WebCanvas包含三个核心组件以实现真实评估：(1) 一种新颖的评估指标，能可靠捕捉任务完成所需的关键中间动作或状态，同时忽略由无关事件或变化的网页元素引起的噪声；(2) 名为Mind2Web-Live的基准数据集，这是对原始Mind2Web静态数据集的优化版本，包含542项任务与2439个中间评估状态；(3) 轻量级且可泛化的标注工具与测试流水线，使研究社区能够收集和维护高质量、最新的数据集。基于WebCanvas，我们开源了一个具备可扩展推理模块的智能体框架，为社区开展在线推理与评估提供基础。我们性能最佳的智能体在Mind2Web-Live测试集上实现了23.1%的任务成功率和48.8%的任务完成率。此外，我们分析了不同网站、领域和实验环境下的性能差异。我们鼓励研究社区就在线智能体评估贡献更多见解，共同推动该研究领域的发展。