For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research.
翻译:为使网络智能体具备实际应用价值,其必须适应持续演变的网络环境,该环境以用户界面与内容的频繁更新为特征。然而,现有大多数基准测试仅能捕捉网络的静态特性。为弥补这一差距,我们提出了WebCanvas——一个创新的在线网络智能体评估框架,能有效应对网络交互的动态特性。WebCanvas包含三个主要组件以促进真实评估:(1) 一种新颖的评估指标,能可靠捕捉任务完成所需的关键中间动作或状态,同时忽略由无关事件或变化的网页元素引起的噪声;(2) 一个名为Mind2Web-Live的基准数据集,这是对原始Mind2Web静态数据集的改进版本,包含542项任务及2439个中间评估状态;(3) 轻量级且可泛化的标注工具与测试流水线,使研究社区能够收集并维护高质量、最新的数据集。基于WebCanvas,我们开源了一个具备可扩展推理模块的智能体框架,为社区进行在线推理与评估提供了基础。我们性能最佳的智能体在Mind2Web-Live测试集上实现了23.1%的任务成功率和48.8%的任务完成率。此外,我们分析了不同网站、领域及实验环境间的性能差异。我们鼓励社区对在线智能体评估贡献更多见解,从而推动该研究领域的发展。