微调与RAG在基于新知识的多元问题回答中的对比研究 (Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge)

Multi-hop question answering is widely used to evaluate the reasoning capabilities of large language models (LLMs), as it requires integrating multiple pieces of supporting knowledge to arrive at a correct answer. While prior work has explored different mechanisms for providing knowledge to LLMs, such as finetuning and retrieval-augmented generation (RAG), their relative effectiveness for multi-hop question answering remains insufficiently understood, particularly when the required knowledge is temporally novel. In this paper, we systematically compare parametric and non-parametric knowledge injection methods for open-domain multi-hop question answering. We evaluate unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and retrieval-augmented generation across three 7B-parameter open-source LLMs. Experiments are conducted on two benchmarks: QASC, a standard multi-hop science question answering dataset, and a newly constructed dataset of over 10,000 multi-hop questions derived from Wikipedia events in 2024, designed to test knowledge beyond the models' pretraining cutoff. Our results show that unsupervised fine-tuning provides only limited gains over base models, suggesting that continual pretraining alone is insufficient for improving multi-hop reasoning accuracy. In contrast, retrieval-augmented generation yields substantial and consistent improvements, particularly when answering questions that rely on temporally novel information. Supervised fine-tuning achieves the highest overall accuracy across models and datasets. These findings highlight fundamental differences in how knowledge injection mechanisms support multi-hop question answering and underscore the importance of retrieval-based methods when external or compositional knowledge is required.

翻译：多元问题回答被广泛用于评估大语言模型的推理能力，因为它需要整合多个支持性知识片段才能得出正确答案。尽管先前的研究已经探索了向大语言模型提供知识的不同机制，例如微调和检索增强生成，但它们在多元问题回答中的相对有效性仍未得到充分理解，尤其是在所需知识具有时间新颖性的情况下。本文系统比较了开放领域多元问题回答中的参数化和非参数化知识注入方法。我们在三个70亿参数的开源大语言模型上评估了无监督微调、监督微调和检索增强生成。实验在两个基准测试上进行：QASC（一个标准的多元科学问题回答数据集）和一个新构建的数据集，该数据集包含超过10,000个源自2024年维基百科事件的多元问题，旨在测试模型预训练截止日期之后的知识。我们的结果表明，无监督微调相比基础模型仅带来有限的增益，这表明仅靠持续预训练不足以提高多元推理的准确性。相比之下，检索增强生成带来了显著且一致的改进，尤其是在回答依赖时间新颖信息的问题时。监督微调在模型和数据集上实现了最高的总体准确率。这些发现凸显了知识注入机制支持多元问题回答的根本差异，并强调了在需要外部或组合知识时基于检索方法的重要性。