Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single aggregated corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose modality-aware routing, which dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it, and further justify its effectiveness with a theoretical analysis. Moreover, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 10 benchmarks of multiple modalities, showing its superiority over various modality-specific and unified baselines.
翻译:检索增强生成(RAG)通过将模型响应与查询相关的外部知识相结合,在提升事实准确性方面展现出巨大潜力。然而,现有方法大多局限于纯文本语料库;尽管近期研究已将RAG扩展至图像、视频等其他模态,但这些方法通常仅针对单一模态的专用语料库进行操作。相比之下,现实世界中的查询所需的知识类型差异巨大,单一类型的知识源无法满足需求。为此,我们提出了UniversalRAG,旨在从具有多样化模态和粒度的异构知识源中检索并整合知识。具体而言,我们观察到将所有模态强制映射到源自单一聚合语料库的统一表示空间会导致模态鸿沟——检索过程倾向于偏好与查询相同模态的内容。基于此,我们提出模态感知路由机制,该机制能动态识别最合适的模态专用语料库并在其中执行定向检索,并通过理论分析进一步论证其有效性。此外,除模态维度外,我们将每种模态组织为多个粒度层级,从而能够根据查询的复杂度和范围进行精细化检索。我们在涵盖多模态的10个基准测试上验证了UniversalRAG,结果表明其性能优于多种模态专用基线及统一基线方法。