LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

Long video question answering (Long-Video QA) increasingly relies on agentic tool use to retrieve evidence from long videos. In realistic settings, this process often requires multi-hop retrieval, where agents must iteratively gather multiple discontinuous evidence clips. However, existing long-video benchmarks are largely static: they rarely enforce strict multi-hop retrieval and typically lack a standardized evidence-access interface, making it difficult to separate failures in retrieval planning from those in answer generation. To address this gap, we introduce LongVidSearch, a benchmark for evaluating agentic multi-hop evidence retrieval planning in long videos under standardized access constraints. LongVidSearch enforces retrieval necessity: a Hop-k question requires exactly k necessary evidence clips, and removing any single clip renders the question unsolvable. The benchmark contains 3,000 questions over 447 long videos (average length 26 minutes), covering four reasoning categories: State Mutation, Causal Inference, Global Summary, and Visual Tracking, with 2-hop, 3-hop, and 4-hop evidence requirements. To ensure fair and controlled evaluation, all agents interact with LongVidSearch through a unified tool interface, which fixes the retrieval backend and isolates the agent's ability to formulate queries and plan iterative retrieval. In addition to answer accuracy, we measure tool-call cost to analyze the accuracy-efficiency trade-off under identical access conditions. We evaluate VideoAgent-style QA agents with multiple backbone LLMs using three-judge majority voting. GPT-5 achieves the highest accuracy (42.43), outperforming Gemini 3 Pro (30.97) and GPT-4o (19.20), yet remaining below 50 %, highlighting the difficulty of multi-hop retrieval planning. With gold evidence clips, performance becomes near-perfect, confirming retrieval planning as the primary bottleneck.

翻译：长视频问答日益依赖于智能体工具使用，以从长视频中检索证据。在现实场景中，此过程通常需要多跳检索，即智能体必须迭代式地收集多个不连续的证据片段。然而，现有的长视频基准大多是静态的：它们很少强制执行严格的多跳检索，并且通常缺乏标准化的证据访问接口，从而难以将检索规划中的失败与答案生成中的失败区分开来。为弥补这一差距，我们提出了LongVidSearch，一个在标准化访问约束下评估长视频中智能体多跳证据检索规划的基准。LongVidSearch强制执行检索必要性：一个Hop-k问题恰好需要k个必要的证据片段，移除其中任何一个片段都会导致问题无法解答。该基准包含基于447个长视频（平均时长26分钟）的3000个问题，涵盖四种推理类别：状态突变、因果推理、全局总结和视觉追踪，并具有2跳、3跳和4跳的证据要求。为确保公平和受控的评估，所有智能体均通过统一的工具接口与LongVidSearch交互，该接口固定了检索后端，从而将智能体制定查询和规划迭代检索的能力独立出来。除了答案准确性，我们还测量工具调用成本，以分析在相同访问条件下的准确性与效率权衡。我们使用三评委多数投票法，评估了采用多种骨干大语言模型的VideoAgent风格问答智能体。GPT-5取得了最高的准确率（42.43），优于Gemini 3 Pro（30.97）和GPT-4o（19.20），但仍低于50%，突显了多跳检索规划的难度。在提供黄金证据片段的情况下，性能接近完美，证实检索规划是主要的瓶颈。