Addressing the intricacies of open-domain question answering (QA) necessitates the extraction of pertinent information from expansive corpora to formulate precise answers. This paper introduces an innovative methodology, termed Generator-Retriever-Generator (GRG), which synergizes document retrieval strategies with advanced large language models (LLMs). The process commences with the LLM generating context-specific documents in response to a posed question. Concurrently, a sophisticated dual-encoder network undertakes the retrieval of documents pertinent to the question from an extensive external corpus. Both the generated and retrieved documents are subsequently processed by a second LLM, tasked with producing the definitive answer. By amalgamating the processes of document retrieval and LLM-based generation, our method adeptly navigates the complexities associated with open-domain QA, notably in delivering informative and contextually apt answers. Our GRG model demonstrably surpasses existing state-of-the-art methodologies, including generate-then-read and retrieve-then-read frameworks (GENREAD and RFiD), enhancing their performance by minimum margins of +5.2, +4.2, and +1.6 on the TriviaQA, NQ, and WebQ datasets, respectively. For further exploration and replication of our findings, we have made available the code, datasets, and checkpoints at \footnote{\url{https://github.com/abdoelsayed2016/GRG}}.
翻译:为应对开放域问答(open-domain QA)中的复杂问题,需从海量语料库中提取相关信息并生成精准答案。本文提出一种创新方法——生成器-检索器-生成器框架(GRG),该框架将文档检索策略与先进的大语言模型(LLMs)有机结合。该流程首先由LLM针对所提问题生成上下文相关文档;与此同时,一个双编码器网络从外部大规模语料库中检索与问题相关的文档。随后,由第二个LLM对生成的文档与检索的文档进行联合处理,并生成最终答案。通过融合文档检索与基于LLM的生成过程,本文方法有效应对了开放域问答的复杂性,尤其在提供信息丰富且上下文贴切的答案方面表现突出。我们的GRG模型显著超越了现有最优方法,包括先生成后读取与先检索后读取框架(GENREAD与RFiD),在TriviaQA、NQ及WebQ数据集上分别实现了+5.2、+4.2与+1.6的最低性能提升。为便于进一步探索与复现研究结果,我们已在\footnote{\url{https://github.com/abdoelsayed2016/GRG}}公开了代码、数据集与模型检查点。