Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Wenxuan Huang,Yu Zeng,Qiuchen Wang,Zhen Fang,Shaosheng Cao,Zheng Chu,Qingyu Yin,Shuang Chen,Zhenfei Yin,Lin Chen,Zehui Chen,Yao Hu,Philip Torr,Feng Zhao,Wanli Ouyang

Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call'' for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision-DeepResearch, which proposes one new multimodal deep-research paradigm, i.e., performs multi-turn, multi-entity and multi-scale visual and textual search to robustly hit real-world search engines under heavy noise. Our Vision-DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training, resulting in a strong end-to-end multimodal deep-research MLLM. It substantially outperforming existing multimodal deep-research MLLMs, and workflows built on strong closed-source foundation model such as GPT-5, Gemini-2.5-pro and Claude-4-Sonnet. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

翻译：多模态大语言模型（MLLMs）在广泛的视觉任务上取得了显著成功。然而，受限于其内部世界知识的容量，先前工作提出通过“推理-工具调用”范式增强MLLMs，利用视觉与文本搜索引擎，在需要大量事实信息的任务上获得显著提升。然而，这些方法通常在一种简单设定下定义多模态搜索，即假设单个全图级或实体级图像查询及少量文本查询便足以检索到回答问题所需的关键证据，这在具有大量视觉噪声的现实场景中并不现实。此外，它们通常在推理深度和搜索广度上存在局限，难以解决需要聚合来自不同视觉与文本来源证据的复杂问题。基于此，我们提出Vision-DeepResearch，它提出了一种新的多模态深度研究范式，即在重度噪声下执行多轮次、多实体、多尺度的视觉与文本搜索，以稳健地命中现实世界搜索引擎。我们的Vision-DeepResearch支持数十个推理步骤和数百次引擎交互，同时通过冷启动监督和强化学习训练将深度研究能力内化到MLLM中，从而形成一个强大的端到端多模态深度研究MLLM。它在性能上大幅超越了现有的多模态深度研究MLLMs，以及基于GPT-5、Gemini-2.5-pro和Claude-4-Sonnet等强大闭源基础模型构建的工作流。代码将在 https://github.com/Osilly/Vision-DeepResearch 发布。