BayesRAG：基于概率互证的多模态检索增强生成 (BayesRAG: Probabilistic Mutual Evidence Corroboration for Multimodal Retrieval-Augmented Generation)

Retrieval-Augmented Generation (RAG) has become a pivotal paradigm for Large Language Models (LLMs), yet current approaches struggle with visually rich documents by treating text and images as isolated retrieval targets. Existing methods relying solely on cosine similarity often fail to capture the semantic reinforcement provided by cross-modal alignment and layout-induced coherence. To address these limitations, we propose BayesRAG, a novel multimodal retrieval framework grounded in Bayesian inference and Dempster-Shafer evidence theory. Unlike traditional approaches that rank candidates strictly by similarity, BayesRAG models the intrinsic consistency of retrieved candidates across modalities as probabilistic evidence to refine retrieval confidence. Specifically, our method computes the posterior association probability for combinations of multimodal retrieval results, prioritizing text-image pairs that mutually corroborate each other in terms of both semantics and layout. Extensive experiments demonstrate that BayesRAG significantly outperforms state-of-the-art (SOTA) methods on challenging multimodal benchmarks. This study establishes a new paradigm for multimodal retrieval fusion that effectively resolves the isolation of heterogeneous modalities through an evidence fusion mechanism and enhances the robustness of retrieval outcomes. Our code is available at https://github.com/TioeAre/BayesRAG.

翻译：检索增强生成已成为大型语言模型的关键范式，然而现有方法在处理视觉丰富的文档时，往往将文本和图像视为孤立的检索目标。仅依赖余弦相似度的现有方法通常难以捕捉跨模态对齐和布局诱导连贯性所提供的语义强化。为应对这些局限，我们提出BayesRAG——一个基于贝叶斯推理和Dempster-Shafer证据理论的新型多模态检索框架。与传统方法严格按相似度排序候选结果不同，BayesRAG将跨模态检索结果的本质一致性建模为概率证据，以优化检索置信度。具体而言，本方法通过计算多模态检索结果组合的后验关联概率，优先选择在语义和布局上能相互印证的图文对。大量实验表明，BayesRAG在具有挑战性的多模态基准测试中显著优于现有最优方法。本研究通过证据融合机制有效解决了异构模态的孤立性问题，提升了检索结果的鲁棒性，从而确立了多模态检索融合的新范式。代码已开源：https://github.com/TioeAre/BayesRAG。