Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

Retrieval-augmented generation (RAG) relies on retrieved context to guide large language models (LLM), yet treats retrieval as a weak heuristic rather than verifiable evidence -- leading to unsupported answers, hallucinations, and reliance on spurious context. We introduce a novel training framework that treats the RAG pipeline as an interactive proof system by adapting the Merlin-Arthur (M/A) protocol: Arthur (the generator LLM) trains on questions with unknown context provenance and Merlin gives helpful evidence, while Morgana injects adversarial, misleading context. Both use an XAI method to identify and modify evidence most influential to Arthur. This trains Arthur to (1) answer when evidence supports the answer, (2) reject when evidence is insufficient, and (3) rely on the context spans that truly ground the answer. We further introduce a verification framework that disentangles explanation fidelity from model predictive errors, and introduce the Explained Information Fraction (EIF), which normalizes M/A mutual-information guarantees. Across three RAG datasets and multiple LLM families and sizes, M/A training makes LLMs more grounded in evidence, increases information theoretic measures (soundness, completeness) and reject behavior with less hallucinations, without manually annotated unanswerable samples. Finally, the retriever also improves recall and MRR via automatically generated M/A hard positives and negatives. While high accuracy does not guarantee entropy flow from context to answer, our EIF results show that autonomous interactive-proof-style supervision enables RAG systems that treat retrieved documents as verifiable evidence. % rather than suggestions.

翻译：检索增强生成（RAG）依赖检索到的上下文来引导大语言模型（LLM），却将检索视为一种弱启发式方法而非可验证的证据——这导致模型生成缺乏支持的答案、产生幻觉并依赖虚假上下文。我们提出一种新颖的训练框架，通过采用Merlin-Arthur（M/A）协议将RAG流程视为交互式证明系统：Arthur（生成器LLM）在上下文来源未知的问题上进行训练，Merlin提供有益证据，而Morgana则注入对抗性、误导性上下文。两者均使用一种可解释人工智能方法识别并修改对Arthur最具影响力的证据。这训练Arthur实现：（1）在证据支持答案时作出回答，（2）在证据不足时拒绝回答，（3）依赖真正支撑答案的上下文片段。我们进一步提出一个验证框架，将解释保真度与模型预测误差解耦，并引入解释信息分数（EIF），用于归一化M/A互信息保证。在三个RAG数据集及多种LLM系列和规模上的实验表明，M/A训练使LLM更基于证据，提高了信息论度量指标（可靠性、完备性）和拒绝回答行为，同时减少幻觉，且无需人工标注不可回答样本。此外，通过自动生成的M/A困难正负样本，检索器的召回率和平均倒数排名也得到提升。虽然高准确率不能保证从上下文到答案的熵流，但我们的EIF结果表明，自主的交互式证明风格监督能使RAG系统将检索文档视为可验证证据。%而非建议。