Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce \textbf{InterLV-Search}, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence seeking, search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV-Search-Bench
翻译:摘要:现有面向多模态智能搜索的基准主要评估多模态搜索与视觉浏览能力,但视觉证据要么局限于输入阶段,要么被视为答案终点而非交错搜索轨迹的组成部分。我们提出**InterLV-Search**,一种用于交错语言-视觉智能搜索的基准,其中文本与视觉证据被反复用于条件化后续搜索。该基准包含2,061个样本,涵盖三个层级:主动视觉证据获取、受控离线交错多模态搜索、以及开放网络交错多模态搜索。与现有基准不同,它还包括多模态多分支样本,涉及证据搜索过程中多个实体间的比较。层级1与层级2通过自动化流水线构建,层级3则采用机器主导、人工监督的开放网络流水线。我们进一步提供InterLV-Agent以实现标准化工具调用、轨迹记录与评估。基于专有及开源多模态智能体的实验表明,现有系统远未解决交错多模态搜索问题,最佳模型整体准确率低于50%,凸显了视觉证据获取、搜索控制及多模态证据整合方面的挑战。我们在https://github.com/hbhalpha/InterLV-Search-Bench 开源基准数据与评估代码。