The advancement of large language models (LLMs) leads to a new era marked by the development of autonomous applications in the real world, which drives innovation in the creation of advanced web-based agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we propose a new evaluation protocol for web agents to address the challenges of automatic evaluation of open-ended web agent tasks, leveraging the robust multimodal comprehension capabilities of GPT-4V. We create a new benchmark by gathering real-world tasks from 15 widely used websites to evaluate our agents. We show that WebVoyager achieves a 55.7% task success rate, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager in practical applications. We found that our proposed automatic evaluation achieves 85.3% agreement with human judgment, paving the way for further development of web agents in a real-world setting.
翻译:大型语言模型(LLMs)的进步开启了一个由现实世界自主应用发展所标志的新时代,推动了先进网络代理的创新。现有网络代理通常仅处理单一输入模态,且仅在简化的网络模拟器或静态网络快照中进行评估,这严重限制了其在真实场景中的适用性。为弥补这一差距,我们提出WebVoyager——一种创新的基于大型多模态模型(LMM)的网络代理,能够通过与真实网站交互端到端地完成用户指令。此外,我们提出一种新的网络代理评估协议,利用GPT-4V强大的多模态理解能力,解决开放型网络代理任务自动评估的挑战。我们通过收集来自15个广泛使用网站的真实任务构建了一个新基准,用于评估我们的代理。实验表明,WebVoyager实现了55.7%的任务成功率,显著超越了GPT-4(全工具)及WebVoyager(纯文本)设置的表现,凸显了WebVoyager在实际应用中的卓越能力。我们发现,所提出的自动评估与人类判断的一致性达到85.3%,这为现实环境中网络代理的进一步发展铺平了道路。