Retrieval-augmented generation (RAG) utilizes retrieved texts to enhance large language models (LLMs). Studies show that while RAG provides valuable external information (benefit), it may also mislead LLMs (detriment) with noisy or incorrect retrieved texts. Although many existing methods attempt to preserve benefit and avoid detriment, they lack a theoretical explanation for RAG. The benefit and detriment in the next token prediction of RAG remain a black box that cannot be quantified or compared in an explainable manner, so existing methods are data-driven, need additional utility evaluators or post-hoc. This paper takes the first step towards providing a theory to explain and trade off the benefit and detriment in RAG. First, we model RAG as the fusion between distribution of LLMs knowledge and distribution of retrieved texts. Then, we formalize the trade-off between the value of external knowledge (benefit) and its potential risk of misleading LLMs (detriment) in next token prediction of RAG by distribution difference in this fusion. Finally, we prove that the actual effect of RAG on the token, which is the comparison between benefit and detriment, can be predicted without any training or accessing the utility of retrieval. Based on our theory, we propose a practical novel method, Tok-RAG, which achieves collaborative generation between the pure LLM and RAG at token level to preserve benefit and avoid detriment. Experiments in real-world tasks using LLMs such as OPT, LLaMA-2, and Mistral show the effectiveness of our method and support our theoretical findings.
翻译:检索增强生成(RAG)利用检索到的文本来增强大语言模型(LLMs)。研究表明,尽管RAG提供了有价值的外部信息(收益),但也可能因检索到噪声或错误文本而误导LLMs(损害)。虽然现有许多方法试图保留收益并避免损害,但它们缺乏对RAG的理论解释。RAG在下一令牌预测中的收益与损害仍是一个黑箱,无法以可解释的方式进行量化或比较,因此现有方法多为数据驱动,需要额外的效用评估器或后处理机制。本文迈出了为RAG中的收益与损害提供解释及权衡理论的第一步。首先,我们将RAG建模为LLMs知识分布与检索文本分布之间的融合过程。接着,通过该融合中的分布差异,形式化地描述了RAG在下一令牌预测中外部知识价值(收益)与其误导LLMs潜在风险(损害)之间的权衡关系。最后,我们证明无需任何训练或获取检索效用,即可预测RAG对令牌的实际影响——即收益与损害的比较。基于该理论,我们提出了一种新颖的实用方法Tok-RAG,在令牌级别实现纯LLM与RAG的协同生成,以保留收益并规避损害。使用OPT、LLaMA-2和Mistral等LLM在真实任务中的实验验证了方法的有效性,并支持了我们的理论发现。