The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
翻译:BrowserGym生态系统旨在应对日益增长的网页智能体高效评估与基准测试需求,尤其针对利用自动化技术和大型语言模型(LLMs)执行网页交互任务的智能体。现有许多基准测试存在碎片化及评估方法不一致的问题,导致难以实现可靠的性能比较与可复现的结果。BrowserGym通过提供具有明确定义的观测空间与动作空间的、类健身房统一环境来解决这一问题,从而促进跨多样化基准的标准化评估。结合其辅助框架AgentLab——该框架支持智能体的创建、测试与分析——BrowserGym在确保评估一致性与实验管理全面性的同时,为集成新基准提供了灵活性。这一标准化方法致力于降低开发网页智能体的时间成本与复杂度,支持更可靠的性能比较,并促进对智能体行为的深入分析,有望催生更具适应性与能力的智能体,最终加速LLM驱动自动化领域的创新。作为实证支撑,我们开展了首次大规模、多基准的网页智能体实验,在BrowserGym当前所有可用基准上比较了6种前沿LLM的性能。除其他发现外,我们的结果揭示了OpenAI与Anthropic最新模型间的显著性能差异:Claude-3.5-Sonnet在几乎所有基准测试中领先,仅在视觉相关任务上GPT-4o表现更优。尽管取得这些进展,我们的研究结果仍表明,由于现实网页环境固有的复杂性以及当前模型的局限性,构建鲁棒高效的网页智能体依然面临重大挑战。