Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.
翻译:多模态大语言模型在诸多视觉任务中展现出卓越能力,但在面对复杂开放世界场景时,其事实性基础往往存在不足。尽管近期多模态深度搜索智能体尝试通过调用外部工具解决该问题,但视觉原生搜索范式仍未被充分探索。现有方法主要依赖显式语义的简单图像与纯文本证据路径,限制了智能体执行多跳跨模态推理与搜索的能力。为解决上述局限,我们提出视觉探针——一种基于主动视觉推理的视觉原生多模态深度搜索智能体。该智能体不将视觉视为静态输入,而是主动关注细粒度视觉细节,在搜索过程中动态收集视觉证据。为释放其视觉原生潜能,我们设计了主动视觉推理数据管道,并合成5K条高质量多模态轨迹用于模型训练。大量实验表明,该方法在五项具有挑战性的多模态搜索基准上达到业界最优性能,甚至超越了若干闭源模型,验证了其在真实网络环境中稳健的视觉原生推理与搜索能力。代码与数据可访问:https://github.com/ZhengboZhang/Visual-Seeker。