Open-domain question answering (QA) tasks usually require the retrieval of relevant information from a large corpus to generate accurate answers. We propose a novel approach called Generator-Retriever-Generator (GRG) that combines document retrieval techniques with a large language model (LLM), by first prompting the model to generate contextual documents based on a given question. In parallel, a dual-encoder network retrieves documents that are relevant to the question from an external corpus. The generated and retrieved documents are then passed to the second LLM, which generates the final answer. By combining document retrieval and LLM generation, our approach addresses the challenges of open-domain QA, such as generating informative and contextually relevant answers. GRG outperforms the state-of-the-art generate-then-read and retrieve-then-read pipelines (GENREAD and RFiD) improving their performance by at least by +5.2, +4.2, and +1.6 on TriviaQA, NQ, and WebQ datasets, respectively. We provide code, datasets, and checkpoints at https://github.com/abdoelsayed2016/GRG.
翻译:开放域问答任务通常需要从大规模语料库中检索相关信息来生成准确的答案。我们提出一种名为生成器-检索器-生成器(GRG)的新方法,该方法将文档检索技术与大语言模型相结合,首先引导模型基于给定问题生成上下文相关文档。同时,双编码器网络从外部语料库中检索与问题相关的文档。生成的文档与检索到的文档随后被输入到第二个大语言模型中,由其生成最终答案。通过结合文档检索与大语言模型生成,我们的方法有效应对了开放域问答中的挑战,例如生成信息丰富且上下文相关的答案。GRG在TriviaQA、NQ和WebQ数据集上的表现分别优于当前最先进的“先生成后读取”和“先检索后读取”流水线(GENREAD和RFiD),性能提升至少达+5.2、+4.2和+1.6。我们在https://github.com/abdoelsayed2016/GRG提供代码、数据集和检查点。