Uncertainty Quantification for Multimodal Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) improves the question answering capabilities of Large Language Models (LLMs) by incorporating external knowledge and has recently been extended to multimodal settings through Vision-Language Models (VLMs) that integrate visual and textual information. Despite these advances, generated answers can still be incorrect or misleading. Uncertainty Quantification (UQ) methods aim to estimate the reliability of model outputs, but most existing approaches are designed for text-only models and perform poorly in multimodal RAG scenarios. A key challenge is capturing uncertainty arising from multiple stages of the pipeline, including retrieval, visual understanding, and generation. In this work, we show that modeling uncertainty using multimodal and retrieval-aware probability signals improves estimation in multimodal RAG systems. We introduce LeMUQ, a Learnable Multimodal UQ method that analyzes token probabilities under input modifications, such as removing modalities or retrieved context. By encoding these signals as probability tokens and processing them with a finetuned model, our approach captures interactions between modalities and retrieval. Experiments across datasets, retrievers, and VLMs show consistent improvements over baseline and finetuned UQ methods. Our proposed LeMUQ increases the AUROC metric by 3.8% on average. Additionally, our method shows strong generalization performance across different retrieval setups and datasets with mixed results when transferring across different VLMs. Our findings highlight the importance of modeling multimodal uncertainty and provide a step toward more reliable and safer multimodal RAG systems. Code is available on GitHub.

翻译：检索增强生成（RAG）通过引入外部知识提升了大型语言模型（LLM）的问答能力，并已通过整合视觉与文本信息的视觉-语言模型（VLM）扩展至多模态场景。尽管取得了这些进展，生成的答案仍可能不正确或具有误导性。不确定性量化（UQ）方法旨在评估模型输出的可靠性，但现有方法多针对纯文本模型设计，在多模态RAG场景中表现不佳。关键挑战在于捕捉检索、视觉理解与生成等多阶段流程中产生的不确定性。本研究表明，利用多模态及检索感知的概率信号进行不确定性建模，可提升多模态RAG系统的评估效果。我们提出LeMUQ——一种可学习的多模态UQ方法，通过分析输入修改（如移除模态或检索上下文）下的词元概率来估计不确定性。通过将这些信号编码为概率词元并使用微调模型进行处理，我们的方法能够捕获模态与检索之间的交互作用。跨数据集、检索器和VLM的实验结果表明，相较于基线及微调后的UQ方法，本方法均呈现一致性改进。所提出的LeMUQ使AUROC指标平均提升3.8%。此外，我们的方法在不同检索设置与数据集上展现出强泛化性能，但跨不同VLM迁移时结果存在差异。研究结果凸显了多模态不确定性建模的重要性，并为构建更可靠、安全的多模态RAG系统提供了新思路。代码已发布于GitHub。