The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. We evaluate on the recent MIND2WEB benchmark. In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites. We show that GPT-4V presents a great potential for web agents - it can successfully complete 50% of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2) specifically fine-tuned for web agents. However, grounding still remains a major challenge. Existing LMM grounding strategies like set-of-mark prompting turns out not effective for web agents, and the best grounding strategy we develop in this paper leverages both the HTML text and visuals. Yet, there is still a substantial gap with oracle grounding, leaving ample room for further improvement.
翻译:近期大型多模态模型(LMMs)的快速发展,特别是GPT-4V(ision)与Gemini,正迅速将多模态模型的能力边界拓展至传统任务(如图像描述和视觉问答)之外。本研究探索了GPT-4V等LMMs作为通才型网络智能体的潜力——其能遵循自然语言指令在任意给定网站上完成任务。我们提出SEEACT系统,这是一个利用LMMs实现网页视觉理解与操作一体化的通才型网络智能体。在最新的MIND2WEB基准测试中,除对缓存网站进行标准离线评估外,我们通过开发实时网站运行工具构建了新型在线评估环境。研究表明GPT-4V在网络智能体领域展现出巨大潜力:若手动将其文本化规划锚定至网站操作,可成功完成实时网站上50%的任务,大幅超越仅基于文本的LLMs(如GPT-4)或经过网络智能体专门微调的小型模型(如FLAN-T5和BLIP-2)。然而,环境锚定(grounding)仍是主要挑战。现有LMMs锚定策略(如标记集提示法)对网络智能体效果不佳,本文开发的最佳锚定方案需同时利用HTML文本与视觉信息。尽管如此,与理想锚定(oracle grounding)之间仍存在显著差距,为后续优化留下充分空间。