BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

Huanyao Zhang,Jiepeng Zhou,Bo Li,Bowen Zhou,Yanzhe Dan,Haishan Lu,Zhiyong Cao,Jiaoyang Chen,Yuqian Han,Zinan Sheng,Zhengwei Tao,Hao Liang,Jialong Wu,Yang Shi,Yuanpeng He,Jiaye Lin,Qintong Zhang,Guochen Yan,Runhao Zhao,Zhengpin Li,Xiaohan Yu,Lang Mei,Chong Chen,Wentao Zhang,Bin Cui

Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments. However, existing benchmarks for multimodal browsing remain limited in task complexity, evidence accessibility, and evaluation granularity, hindering comprehensive and reproducible assessments of deep search capabilities. To address these limitations, we introduce BrowseComp-$V^3$, a novel benchmark consisting of 300 carefully curated and challenging questions spanning diverse domains. The benchmark emphasizes deep, multi-level, and cross-modal multi-hop reasoning, where critical evidence is interleaved across textual and visual modalities within and across web pages. All supporting evidence is strictly required to be publicly searchable, ensuring fairness and reproducibility. Beyond final-answer accuracy, we incorporate an expert-validated, subgoal-driven process evaluation mechanism that enables fine-grained analysis of intermediate reasoning behaviors and systematic characterization of capability boundaries. In addition, we propose OmniSeeker, a unified multimodal browsing agent framework integrating diverse web search and visual perception tools. Comprehensive experiments demonstrate that even state-of-the-art models achieve only 36% accuracy on our benchmark, revealing critical bottlenecks in multimodal information integration and fine-grained perception. Our results highlight a fundamental gap between current model capabilities and robust multimodal deep search in real-world settings.

翻译：多模态大语言模型（MLLMs）凭借日益先进的规划与工具使用能力，正逐步演化为能够在开放世界环境中执行多模态网页浏览与深度搜索的自主代理。然而，现有的多模态浏览基准在任务复杂性、证据可获取性及评估粒度方面仍存在局限，阻碍了对深度搜索能力进行全面且可复现的评估。为应对这些局限，我们提出了BrowseComp-$V^3$，这是一个包含300个精心策划且具有挑战性问题的全新基准，覆盖多个领域。该基准强调深度、多层次及跨模态的多跳推理，其中关键证据在网页内部及跨网页的文本与视觉模态间交织分布。所有支撑证据均严格要求可公开检索，以确保公平性与可复现性。除最终答案准确性外，我们引入了经专家验证的、基于子目标驱动的过程评估机制，支持对中间推理行为进行细粒度分析，并系统刻画能力边界。此外，我们提出了OmniSeeker，一个集成了多样化网络搜索与视觉感知工具的统一多模态浏览代理框架。综合实验表明，即使最先进的模型在我们的基准上也仅达到36%的准确率，揭示了多模态信息整合与细粒度感知方面的关键瓶颈。我们的结果凸显了当前模型能力与真实场景中稳健的多模态深度搜索之间存在根本性差距。