The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
翻译:BrowserGym生态系统旨在满足日益增长的网页智能体高效评估与基准测试需求,尤其针对利用自动化技术和大型语言模型(LLMs)执行网页交互任务的智能体。当前众多基准测试存在碎片化与评估方法不一致的问题,导致可靠比较与可复现结果难以实现。BrowserGym通过提供具有明确定义观察空间与动作空间的统一类gym环境来解决这一难题,促进跨多样化基准的标准化评估。结合辅助性框架AgentLab——该框架支持智能体创建、测试与分析——BrowserGym在整合新基准的同时,确保了评估一致性与实验管理的全面性,兼具灵活性。这一标准化方法致力于降低开发网页智能体的时间成本与复杂度,支持更可靠的性能比较,并促进对智能体行为的深入分析,有望催生更具适应性与能力的智能体,最终加速LLM驱动自动化领域的创新。作为实证支撑,我们开展了首次大规模多基准网页智能体实验,在BrowserGym当前所有可用基准上比较了6种前沿LLM的性能。除其他发现外,我们的结果揭示了OpenAI与Anthropic最新模型间的显著性能差异:Claude-3.5-Sonnet在除视觉相关任务外的几乎所有基准中领先,而GPT-4o在视觉任务上表现更优。尽管取得这些进展,我们的研究结果强调,由于现实网页环境固有的复杂性以及当前模型的局限性,构建鲁棒高效的网页智能体仍是重大挑战。