Retrieval-augmented generation (RAG) has shown impressive capability in providing reliable answer predictions and addressing hallucination problems. A typical RAG implementation uses powerful retrieval models to extract external information and large language models (LLMs) to generate answers. In contrast, recent LLM-based retrieval has gained attention for its substantial improvements in information retrieval (IR) due to the LLMs' semantic understanding capability. However, directly applying LLM to RAG systems presents challenges. This may cause feature locality problems as massive parametric knowledge can hinder effective usage of global information across the corpus; for example, an LLM-based retriever often inputs document summaries instead of full documents. Moreover, various pre-trained tasks in LLMs introduce variance, further weakening performance as a retriever. To address these issues, we propose a novel two-stage fine-tuning architecture called Invar-RAG. In the retrieval stage, an LLM-based retriever is constructed by integrating LoRA-based representation learning to tackle feature locality issues. To enhance retrieval performance, we develop two patterns (invariant and variant patterns) and an invariance loss to reduce LLM variance. In the generation stage, a refined fine-tuning method is employed to improve LLM accuracy in generating answers based on retrieved information. Experimental results show that Invar-RAG significantly outperforms existing baselines across three open-domain question answering (ODQA) datasets. Code is available in the Supplementary Material for reproducibility.
翻译:检索增强生成(RAG)在提供可靠答案预测和解决幻觉问题方面展现出令人印象深刻的能力。典型的RAG实现使用强大的检索模型提取外部信息,并利用大语言模型(LLM)生成答案。相比之下,基于LLM的检索方法因其在语义理解能力上的优势,在信息检索(IR)方面取得了显著改进,从而受到关注。然而,直接将LLM应用于RAG系统仍面临挑战。这可能导致特征局部性问题,因为海量的参数化知识可能阻碍对语料库中全局信息的有效利用;例如,基于LLM的检索器通常输入文档摘要而非完整文档。此外,LLM中各种预训练任务引入的方差会进一步削弱其作为检索器的性能。为解决这些问题,我们提出了一种新颖的两阶段微调架构,称为Invar-RAG。在检索阶段,通过集成基于LoRA的表征学习来构建基于LLM的检索器,以应对特征局部性问题。为提升检索性能,我们设计了两种模式(不变模式与可变模式)以及一种不变性损失函数来降低LLM的方差。在生成阶段,采用一种精细化的微调方法,以提高LLM基于检索信息生成答案的准确性。实验结果表明,在三个开放域问答(ODQA)数据集上,Invar-RAG均显著优于现有基线方法。代码已附于补充材料中,以确保可复现性。