Towards a Unified Language Model for Knowledge-Intensive Tasks Utilizing External Corpus

The advent of large language models (LLMs) has showcased their efficacy across various domains, yet they often hallucinate, especially in knowledge-intensive tasks that require external knowledge sources. To improve factual accuracy of language models, retrieval-augmented generation (RAG) has emerged as a popular solution. However, traditional retrieval modules often rely on large-scale document indexes, which can be disconnected from generative tasks. Through generative retrieval (GR) approach, language models can achieve superior retrieval performance by directly generating relevant document identifiers (DocIDs). However, the relationship between GR and downstream tasks, as well as the potential of LLMs in GR, remains unexplored. In this paper, we present a unified language model that utilizes external corpus to handle various knowledge-intensive tasks by seamlessly integrating generative retrieval, closed-book generation, and RAG. In order to achieve effective retrieval and generation through a unified continuous decoding process, we introduce the following mechanisms: (1) a ranking-oriented DocID decoding strategy, which improves ranking ability by directly learning from a DocID ranking list; (2) a continuous generation strategy to facilitate effective and efficient RAG; (3) well-designed auxiliary DocID understanding tasks to enhance the model's comprehension of DocIDs and their relevance to downstream tasks. Our approach is evaluated on the widely used KILT benchmark using two variants of backbone models: an encoder-decoder T5 model and a decoder-only LLM, Llama2. Experimental results showcase the superior performance of our models in both retrieval and downstream knowledge-intensive tasks.

翻译：大型语言模型（LLMs）的出现展示了其在多个领域的有效性，但尤其在需要外部知识源的知识密集型任务中，它们常常产生幻觉。为提高语言模型的事实准确性，检索增强生成（RAG）已成为一种流行解决方案。然而，传统检索模块往往依赖大规模文档索引，这可能与生成任务脱节。通过生成式检索（GR）方法，语言模型能够直接生成相关文档标识符（DocIDs），从而取得卓越的检索性能。然而，GR与下游任务之间的关系，以及LLMs在GR中的潜力仍未得到探索。本文提出一种利用外部语料库的统一语言模型，通过无缝整合生成式检索、闭卷生成和RAG，处理多种知识密集型任务。为实现通过统一连续解码过程的有效检索与生成，我们引入以下机制：（1）一种面向排序的DocID解码策略，通过直接从DocID排序列表中学习提升排序能力；（2）一种连续生成策略，促进高效且有效的RAG；（3）精心设计的辅助性DocID理解任务，增强模型对DocID及其与下游任务关联性的理解。我们使用两种骨干模型变体——编码器-解码器T5模型和仅解码器LLM（Llama2）——在广泛使用的KILT基准上评估了所提方法。实验结果表明，我们的模型在检索和下游知识密集型任务中均展现出优越性能。