Multimodal Large Language models (MLLMs) have shown promise in web-related tasks, but evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. Existing benchmarks are either designed for general multimodal tasks, failing to capture the unique characteristics of web pages, or focus on end-to-end web agent tasks, unable to measure fine-grained abilities such as OCR, understanding, and grounding. In this paper, we introduce \bench{}, a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks. \bench{} consists of seven tasks, and comprises 1.5K human-curated instances from 139 real websites, covering 87 sub-domains. We evaluate 14 open-source MLLMs, Gemini Pro, Claude-3 series, and GPT-4V(ision) on \bench{}, revealing significant challenges and performance gaps. Further analysis highlights the limitations of current MLLMs, including inadequate grounding in text-rich environments and subpar performance with low-resolution image inputs. We believe \bench{} will serve as a valuable resource for the research community and contribute to the creation of more powerful and versatile MLLMs for web-related applications.
翻译:多模态大语言模型(MLLMs)在网络相关任务中展现出潜力,但由于缺乏综合性基准,评估其在网络领域的性能仍是一项挑战。现有基准要么针对通用多模态任务设计,未能捕捉网页的独特特征;要么聚焦于端到端网络Agent任务,无法衡量OCR、理解与定位等细粒度能力。本文提出\bench{}——一个多模态基准,旨在评估MLLMs在各种网络任务中的能力。\bench{}包含七项任务,涵盖来自139个真实网站的1500个人工整理实例,覆盖87个子领域。我们在\bench{}上评估了14个开源MLLMs、Gemini Pro、Claude-3系列和GPT-4V(ision),揭示了显著的挑战与性能差距。进一步分析凸显了当前MLLMs的局限性,包括在文本丰富环境中定位能力不足,以及处理低分辨率图像输入时表现欠佳。我们相信,\bench{}将成为研究界的宝贵资源,并助力创建更强大、更通用的面向网络应用的MLLMs。