Retrieval-augmented generation from videos requires systems to retrieve relevant audiovisual evidence from large corpora and synthesize it into coherent, attributed text. Current approaches struggle at both ends: retrieval methods fail on complex, multi-faceted queries that cannot be captured by a single embedding, while generation methods lack the high-level reasoning needed to synthesize across multiple videos and face memory constraints over long, multi-video contexts. We present MARQUIS: a three-stage pipeline that addresses these limitations through (1) query expansion, fusion, and reranking, (2) calibrated structured evidence extraction, and (3) article generation from extracted evidence, optionally controlled by an RLM. On the MAGMaR2026 shared task, we improve retrieval performance from 0.195 to 0.759 (nDCG@10). For article generation, ITER-QA-BASE improves average human score from 3.09 to 3.83 over the CAG baseline, while MARQUIS-RLM achieves a human score of 3.30 and the strongest citation recall among non-QA systems.
翻译:从视频中进行检索增强生成要求系统能够从大型语料库中检索相关视听证据,并将其综合成连贯且有归属标注的文本。当前方法在两端均面临挑战:检索方法难以处理无法通过单一嵌入捕捉的复杂多层面查询,而生成方法缺乏跨多个视频进行综合所需的高级推理能力,且在处理长篇幅、多视频上下文时面临内存限制。我们提出MARQUIS:一种三阶段流水线,通过(1)查询扩展、融合与重排序,(2)校准的结构化证据提取,以及(3)基于提取证据的文章生成(可选地由RLM控制)来解决上述局限性。在MAGMaR2026共享任务中,我们将检索性能从0.195提升至0.759(nDCG@10)。在文章生成方面,相较于CAG基线,ITER-QA-BASE将平均人工评分从3.09提升至3.83,而MARQUIS-RLM在非QA系统中取得3.30的人工评分及最强的引文召回率。