Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of $\langle$query, evidence chunk, answer$\rangle$ triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.
翻译:检索增强生成正在从文本领域扩展到长时程、自我中心视频,其中系统必须跨多种模态和时间粒度选择与查询相关的片段。然而,视频RAG的进展受到两个缺口的限制:现有基准允许无需视频即可回答查询,掩盖了检索错误;先前的方法对每个查询应用单一模态-粒度配置,忽略了片段级别的变异性。我们通过引入V-RAGBench(一个包含⟨查询,证据片段,答案⟩三元组的基准,能够忠实、解耦地评估检索和生成)和CARVE(一种简单方法,跨配置运行并行检索器,并采用片段自适应重排序以识别每个片段的最佳配置)来解决这两个问题。每个片段随后在其检索过程中选定的最佳配置下进入生成器,形成一种交错证据形式,其中片段级决策在两个阶段中传播。CARVE优于八种最新的视频RAG基线,提供给生成器的片段交织了多种配置而非共享单一配置,这种方法是查询级方法无法实现的。