Retrieval augmented generation (RAG) with large language models (LLMs) for Question Answering (QA) entails furnishing relevant context within the prompt to facilitate the LLM in answer generation. During the generation, inaccuracies or hallucinations frequently occur due to two primary factors: inadequate or distracting context in the prompts, and the inability of LLMs to effectively reason through the facts. In this paper, we investigate whether providing aligned context via a carefully selected passage sequence leads to better answer generation by the LLM for multi-hop QA. We introduce, "GenSco", a novel approach of selecting passages based on the predicted decomposition of the multi-hop questions}. The framework consists of two distinct LLMs: (i) Generator LLM, which is used for question decomposition and final answer generation; (ii) an auxiliary open-sourced LLM, used as the scorer, to semantically guide the Generator for passage selection. The generator is invoked only once for the answer generation, resulting in a cost-effective and efficient approach. We evaluate on three broadly established multi-hop question answering datasets: 2WikiMultiHop, Adversarial HotPotQA and MuSiQue and achieve an absolute gain of $15.1$ and $5.9$ points in Exact Match score with respect to the best performing baselines over MuSiQue and 2WikiMultiHop respectively.
翻译:基于大型语言模型(LLM)的检索增强生成(RAG)在问答(QA)任务中,通常通过在提示中提供相关上下文来辅助LLM生成答案。然而,在生成过程中,不准确或幻觉现象频繁出现,主要归因于两个因素:提示中上下文信息不足或存在干扰,以及LLM难以对事实进行有效推理。本文研究通过精心选择的段落序列提供对齐的上下文,是否能使LLM在多跳问答中生成更优答案。我们提出“GenSco”——一种基于多跳问题预测分解结果进行段落选择的新方法。该框架包含两个独立的LLM:(i)生成器LLM,用于问题分解及最终答案生成;(ii)作为评分器的辅助开源LLM,通过语义引导生成器进行段落选择。生成器仅需调用一次即可完成答案生成,实现了成本效益与高效性。我们在三个广泛使用的多跳问答数据集(2WikiMultiHop、Adversarial HotPotQA和MuSiQue)上进行评估,在MuSiQue和2WikiMultiHop数据集上相较于最优基线模型,在精确匹配分数上分别实现了$15.1$和$5.9$个百分点的绝对提升。