The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
翻译:BrowserGym生态系统旨在满足日益增长的网页智能体高效评估与基准测试需求,尤其针对利用自动化技术和大型语言模型(LLMs)执行网页交互任务的智能体。当前众多基准测试存在碎片化与评估方法不一致的问题,导致可靠比较与可复现结果难以实现。BrowserGym通过提供具有明确定义的观测空间与动作空间的统一类健身房环境,为多样化基准测试提供标准化评估框架。结合辅助性框架AgentLab——该框架支持智能体创建、测试与分析——BrowserGym在集成新基准测试的同时,确保了评估一致性与实验管理的完备性。这一标准化方法致力于降低网页智能体的开发时间与复杂度,支持更可靠的性能比较,促进对智能体行为的深入分析,有望催生更具适应性与能力的智能体,最终加速LLM驱动自动化领域的创新。作为实证支撑,我们开展了首次大规模多基准网页智能体实验,在BrowserGym当前所有可用基准上比较了6种前沿LLMs的性能。研究结果揭示了OpenAI与Anthropic最新模型间的显著性能差异:Claude-3.5-Sonnet在除视觉相关任务外的几乎所有基准中领先,而GPT-4o在视觉任务中表现更优。尽管取得这些进展,实验结果仍表明,由于现实网页环境固有的复杂性及当前模型的局限性,构建鲁棒高效的网页智能体依然面临重大挑战。