Retrieval-Augmented Generation (RAG) pipelines must address challenges beyond simple single-document retrieval, such as interpreting visual elements (tables, charts, images), synthesizing information across documents, and providing accurate source grounding. Existing benchmarks fail to capture this complexity, often focusing on textual data, single-document comprehension, or evaluating retrieval and generation in isolation. We introduce ViDoRe v3, a comprehensive multimodal RAG benchmark featuring multi-type queries over visually rich document corpora. It covers 10 datasets across diverse professional domains, comprising ~26,000 document pages paired with 3,099 human-verified queries, each available in 6 languages. Through 12,000 hours of human annotation effort, we provide high-quality annotations for retrieval relevance, bounding box localization, and verified reference answers. Our evaluation of state-of-the-art RAG pipelines reveals that visual retrievers outperform textual ones, late-interaction models and textual reranking substantially improve performance, and hybrid or purely visual contexts enhance answer generation quality. However, current models still struggle with non-textual elements, open-ended queries, and fine-grained visual grounding. To encourage progress in addressing these challenges, the benchmark is released under a commercially permissive license at https://hf.co/vidore.
翻译:检索增强生成(RAG)流水线必须应对超越简单单文档检索的挑战,例如解读视觉元素(表格、图表、图像)、跨文档信息综合以及提供准确的来源依据。现有基准测试未能捕捉这种复杂性,通常侧重于文本数据、单文档理解,或孤立地评估检索与生成。我们推出ViDoRe v3,这是一个全面的多模态RAG基准测试,其特点是在视觉丰富的文档语料库上进行多类型查询。它涵盖10个不同专业领域的数据集,包含约26,000个文档页面,并配有3,099个人工验证的查询,每种查询均提供6种语言版本。通过12,000小时的人工标注工作,我们为检索相关性、边界框定位和已验证的参考答案提供了高质量标注。我们对最先进的RAG流水线的评估表明:视觉检索器优于文本检索器;延迟交互模型和文本重排序能显著提升性能;混合或纯视觉上下文可提高答案生成质量。然而,现有模型在处理非文本元素、开放式查询和细粒度视觉定位方面仍存在困难。为鼓励应对这些挑战的进展,该基准测试已在商业友好许可下发布于 https://hf.co/vidore。