Retrieval-augmented generation (RAG) has shown impressive capability in providing reliable answer predictions and addressing hallucination problems. A typical RAG implementation uses powerful retrieval models to extract external information and large language models (LLMs) to generate answers. In contrast, recent LLM-based retrieval has gained attention for its substantial improvements in information retrieval (IR) due to the LLMs' semantic understanding capability. However, directly applying LLM to RAG systems presents challenges. This may cause feature locality problems as massive parametric knowledge can hinder effective usage of global information across the corpus; for example, an LLM-based retriever often inputs document summaries instead of full documents. Moreover, various pre-trained tasks in LLMs introduce variance, further weakening performance as a retriever. To address these issues, we propose a novel two-stage fine-tuning architecture called Invar-RAG. In the retrieval stage, an LLM-based retriever is constructed by integrating LoRA-based representation learning to tackle feature locality issues. To enhance retrieval performance, we develop two patterns (invariant and variant patterns) and an invariance loss to reduce LLM variance. In the generation stage, a refined fine-tuning method is employed to improve LLM accuracy in generating answers based on retrieved information. Experimental results show that Invar-RAG significantly outperforms existing baselines across three open-domain question answering (ODQA) datasets. Code is available in the Supplementary Material for reproducibility.
翻译:检索增强生成(RAG)在提供可靠答案预测和解决幻觉问题方面展现出卓越能力。典型的RAG实现采用强大的检索模型提取外部信息,并利用大语言模型(LLM)生成答案。相比之下,近期基于LLM的检索因其在语义理解能力上的显著提升,在信息检索(IR)领域受到广泛关注。然而,直接将LLM应用于RAG系统存在挑战:海量参数化知识可能阻碍对语料库全局信息的有效利用,例如基于LLM的检索器常输入文档摘要而非完整文档,这可能导致特征局部性问题;此外,LLM中多样的预训练任务会引入方差,进一步削弱其作为检索器的性能。为解决这些问题,我们提出了一种新颖的两阶段微调架构Invar-RAG。在检索阶段,通过集成基于LoRA的表征学习构建LLM检索器以应对特征局部性问题;为提升检索性能,我们设计了两种模式(不变模式与可变模式)并引入不变性损失以降低LLM方差。在生成阶段,采用精细化微调方法提升LLM基于检索信息生成答案的准确性。实验结果表明,Invar-RAG在三个开放域问答(ODQA)数据集上均显著优于现有基线方法。代码已附于补充材料中以确保可复现性。