Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Yu Zeng,Wenxuan Huang,Zhen Fang,Shuang Chen,Yufan Shen,Yishuo Cai,Xiaoman Wang,Zhenfei Yin,Lin Chen,Zehui Chen,Shiting Huang,Yiming Zhao,Xu Tang,Yao Hu,Philip Torr,Wanli Ouyang,Shaosheng Cao

Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

翻译：多模态大语言模型（MLLMs）在视觉问答任务上取得了进展，并已支持利用搜索引擎进行复杂视觉-文本事实查找的 Vision-DeepResearch 系统。然而，评估这些视觉与文本搜索能力仍然具有挑战性，现有基准存在两大主要局限。首先，现有基准并非以视觉搜索为核心：本应依赖视觉搜索的答案，往往通过文本问题中的跨文本线索泄露，或可从当前 MLLMs 已有的世界知识中推断得出。其次，评估场景过于理想化：在图像搜索方面，所需信息通常可通过与完整图像进行近乎精确的匹配来获取；而在文本搜索方面，则过于直接且挑战性不足。为解决这些问题，我们构建了 Vision-DeepResearch 基准（VDR-Bench），包含 2,000 个视觉问答实例。所有问题均通过精心设计的多阶段筛选流程与严格的专家评审创建，旨在评估 Vision-DeepResearch 系统在真实世界实际条件下的表现。此外，针对当前 MLLMs 视觉检索能力不足的问题，我们提出了一种简单的多轮裁剪搜索工作流程。实验表明，该策略能有效提升模型在真实视觉检索场景中的性能。总体而言，我们的研究结果为未来多模态深度研究系统的设计提供了实用指导。代码将在 https://github.com/Osilly/Vision-DeepResearch 发布。