Retrieval Augmented Generation (RAG) is a common method for integrating external knowledge into pretrained Large Language Models (LLMs) to enhance accuracy and relevancy in question answering (QA) tasks. However, prompt engineering and resource efficiency remain significant bottlenecks in developing optimal and robust RAG solutions for real-world QA applications. Recent studies have shown success in using fine tuning to address these problems; in particular, Retrieval Augmented Fine Tuning (RAFT) applied to smaller 7B models has demonstrated superior performance compared to RAG setups with much larger models such as GPT-3.5. The combination of RAFT with parameter-efficient fine tuning (PEFT) techniques, such as Low-Rank Adaptation (LoRA), promises an even more efficient solution, yet remains an unexplored area. In this work, we combine RAFT with LoRA to reduce fine tuning and storage requirements and gain faster inference times while maintaining comparable RAG performance. This results in a more compute-efficient RAFT, or CRAFT, which is particularly useful for knowledge-intensive QA tasks in resource-constrained environments where internet access may be restricted and hardware resources limited.
翻译:检索增强生成(RAG)是一种将外部知识整合到预训练大语言模型(LLM)中的常用方法,旨在提升问答(QA)任务的准确性与相关性。然而,提示工程与资源效率仍是开发面向实际QA应用的最优且鲁棒RAG解决方案的主要瓶颈。近期研究表明,通过微调可有效解决这些问题;特别是应用于较小7B模型的检索增强微调(RAFT),相较于使用GPT-3.5等大得多的模型的RAG设置,已展现出更优的性能。将RAFT与参数高效微调(PEFT)技术(如低秩自适应LoRA)相结合,有望提供一种更高效的解决方案,但这仍是一个尚未探索的领域。在本工作中,我们将RAFT与LoRA结合,以降低微调和存储需求,并获得更快的推理速度,同时保持与RAG相当的性能。由此产生了一种计算效率更高的RAFT,即CRAFT,其特别适用于资源受限环境(如互联网访问受限且硬件资源有限)中的知识密集型QA任务。