Large language models (LLMs) typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG). In this work, we propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG. In particular, the instruction-tuned LLMs work surprisingly well by adding a small fraction of ranking data into the training blend, and outperform existing expert ranking models, including the same LLM exclusively fine-tuned on a large amount of ranking data. For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks. Specifically, our Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks. In addition, it also performs comparably to GPT-4 on five RAG benchmarks in the biomedical domain without instruction fine-tuning on biomedical data, demonstrating its superb capability for generalization to new domains.
翻译:大型语言模型(LLMs)在检索增强生成(RAG)中通常直接使用检索器返回的前k个上下文。本文提出了一种新颖的指令微调框架RankRAG,该框架通过指令微调单个LLM,使其在RAG中同时具备上下文排序和答案生成的双重能力。特别地,仅需在训练混合数据中加入一小部分排序数据,经过指令微调的LLMs即表现出惊人的优异性能,其排序效果超越了现有的专家排序模型,包括同一LLM仅使用大量排序数据进行专门微调的版本。在生成方面,我们将模型与多个强基线进行了比较,包括GPT-4-0613、GPT-4-turbo-2024-0409以及在RAG基准测试中达到最先进性能的开源模型ChatQA-1.5。具体而言,我们的Llama3-RankRAG在九个知识密集型基准测试上显著优于Llama3-ChatQA-1.5和GPT-4系列模型。此外,在未使用生物医学数据进行指令微调的情况下,它在五个生物医学领域的RAG基准测试中表现与GPT-4相当,这证明了其卓越的跨领域泛化能力。