CorpusLM: Towards a Unified Language Model on Corpus for Knowledge-Intensive Tasks

Large language models (LLMs) have gained significant attention in various fields but prone to hallucination, especially in knowledge-intensive (KI) tasks. To address this, retrieval-augmented generation (RAG) has emerged as a popular solution to enhance factual accuracy. However, traditional retrieval modules often rely on large document index and disconnect with generative tasks. With the advent of generative retrieval (GR), language models can retrieve by directly generating document identifiers (DocIDs), offering superior performance in retrieval tasks. However, the potential relationship between GR and downstream tasks remains unexplored. In this paper, we propose \textbf{CorpusLM}, a unified language model that leverages external corpus to tackle various knowledge-intensive tasks by integrating generative retrieval, closed-book generation, and RAG through a unified greedy decoding process. We design the following mechanisms to facilitate effective retrieval and generation, and improve the end-to-end effectiveness of KI tasks: (1) We develop a ranking-oriented DocID list generation strategy, which refines GR by directly learning from a DocID ranking list, to improve retrieval quality. (2) We design a continuous DocIDs-References-Answer generation strategy, which facilitates effective and efficient RAG. (3) We employ well-designed unsupervised DocID understanding tasks, to comprehend DocID semantics and their relevance to downstream tasks. We evaluate our approach on the widely used KILT benchmark with two variants of backbone models, i.e., T5 and Llama2. Experimental results demonstrate the superior performance of our models in both retrieval and downstream tasks.

翻译：大型语言模型（LLMs）在各领域受到广泛关注，但在知识密集型（KI）任务中易出现幻觉现象。为此，检索增强生成（RAG）作为提升事实准确性的主流方案应运而生。然而传统检索模块通常依赖大规模文档索引，且与生成任务存在割裂。随着生成式检索（GR）技术的出现，语言模型可通过直接生成文档标识符（DocIDs）实现检索，在检索任务中展现卓越性能，但GR与下游任务间的潜在关系尚未得到探索。本文提出\textbf{CorpusLM}——一种统一语言模型，通过整合生成式检索、闭书生成和检索增强生成，以统一贪心解码流程利用外部语料库处理各类知识密集型任务。我们设计以下机制以促进有效检索与生成，提升KI任务的端到端效能：（1）提出面向排序的DocID列表生成策略，通过直接从DocID排序列表学习来优化GR，提升检索质量；（2）设计连续DocID-参考文献-答案生成策略，实现高效且低成本的RAG；（3）采用精心设计的无监督DocID理解任务，使模型理解DocID语义及其与下游任务的关联性。我们在广泛使用的KILT基准上，采用T5和Llama2两种骨干模型变体进行评测。实验结果表明，本模型在检索任务和下游任务中均展现出卓越性能。