Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.
翻译:现有多模态检索系统在语义匹配方面表现卓越,但隐含假设查询-图像相关性可在孤立条件下衡量。这一范式忽视了现实视觉流中固有的丰富依赖关系——信息分布于时间序列而非局限于单一快照。为弥补这一空白,我们提出DeepImageSearch这一新型智能体范式,将图像检索重构为自主探索任务。模型需对原始视觉历史进行规划并执行多步推理,基于隐式上下文线索定位目标。我们构建了DISBench——基于互联视觉数据的高难度基准测试。针对上下文相关查询生成的可扩展性挑战,提出人机协作流水线,利用视觉语言模型挖掘潜在时空关联,在人工验证前有效完成密集上下文发现。此外,采用配备细粒度工具与双记忆系统的模块化智能体框架构建强基线,支持长程导航。大量实验表明,DISBench对现有最优模型构成显著挑战,凸显将智能体推理融入下一代检索系统的必要性。