Retrieval-Augmented Generation (RAG) has become a core paradigm in document question answering tasks. However, existing methods have limitations when dealing with multimodal documents: one category of methods relies on layout analysis and text extraction, which can only utilize explicit text information and struggle to capture images or unstructured content; the other category treats document segmentation as visual input and directly passes it to visual language models (VLMs) for processing, yet it ignores the semantic advantages of text, leading to suboptimal retrieval and generation results. To address these research gaps, we propose the Co-Modality-based RAG (CMRAG) framework, which can simultaneously leverage texts and images for more accurate retrieval and generation. Our framework includes two key components: (1) a Unified Encoding Model (UEM) that projects queries, parsed text, and images into a shared embedding space via triplet-based training, and (2) a Unified Co-Modality-informed Retrieval (UCMR) method that statistically normalizes similarity scores to effectively fuse cross-modal signals. To support research in this direction, we further construct and release a large-scale triplet dataset of (query, text, image) examples. Experiments demonstrate that our proposed framework consistently outperforms single-modality--based RAG in multiple visual document question-answering (VDQA) benchmarks. The findings of this paper show that integrating co-modality information into the RAG framework in a unified manner is an effective approach to improving the performance of complex VDQA systems.
翻译:检索增强生成(RAG)已成为文档问答任务的核心范式。然而,现有方法在处理多模态文档时存在局限:一类方法依赖于版面分析和文本提取,仅能利用显式文本信息,难以捕捉图像或非结构化内容;另一类方法将文档片段视为视觉输入并直接传递给视觉语言模型(VLM)处理,却忽略了文本的语义优势,导致检索与生成效果欠佳。为弥补这些研究空白,我们提出了基于协同模态的RAG(CMRAG)框架,该框架能够同时利用文本和图像实现更精准的检索与生成。我们的框架包含两个关键组件:(1)统一编码模型(UEM),通过基于三元组的训练将查询、解析文本和图像映射到共享嵌入空间;(2)统一协同模态感知检索(UCMR)方法,通过统计归一化相似度分数有效融合跨模态信号。为支持该方向的研究,我们进一步构建并发布了大规模(查询、文本、图像)三元组数据集。实验表明,我们提出的框架在多个视觉文档问答(VDQA)基准测试中持续优于基于单模态的RAG方法。本文的研究结果表明,以统一方式将协同模态信息整合到RAG框架中是提升复杂VDQA系统性能的有效途径。