The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs). Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. In an earlier work, Drouin et al. (2024) introduced BrowserGym which aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature and includes AgentLab, a complementary framework that aids in agent creation, testing, and analysis. Our proposed ecosystem offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks made available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
翻译:BrowserGym生态系统旨在应对日益增长的网页智能体高效评估与基准测试需求,尤其针对利用自动化技术与大型语言模型(LLMs)的智能体。现有许多基准测试存在碎片化与评估方法不一致的问题,导致难以实现可靠的性能比较与可复现的结果。在早期工作中,Drouin等人(2024)提出了BrowserGym,通过提供具有明确定义的观测空间与动作空间的统一类gym环境,以促进跨多样化基准的标准化评估。本文提出一个基于BrowserGym的扩展生态系统,该系统整合了文献中现有的网页智能体基准,并包含辅助性框架AgentLab——该框架支持智能体的创建、测试与分析。所提出的生态系统具备灵活集成新基准的能力,同时确保评估一致性及全面的实验管理。作为实证支持,我们开展了首次大规模、多基准的网页智能体实验,在BrowserGym集成的6个主流网页智能体基准上比较了6种前沿LLM的性能。实验结果表明,OpenAI与Anthropic的最新模型存在显著性能差异:Claude-3.5-Sonnet在除视觉相关任务外的几乎所有基准中领先,而GPT-4o在视觉任务上表现更优。尽管取得这些进展,我们的研究结果仍表明,由于现实网页环境固有的复杂性以及当前模型的局限性,构建鲁棒高效的网页智能体依然面临重大挑战。