From Matching to Generation: A Survey on Generative Information Retrieval

Information Retrieval (IR) systems are crucial tools for users to access information, widely applied in scenarios like search engines, question answering, and recommendation systems. Traditional IR methods, based on similarity matching to return ranked lists of documents, have been reliable means of information acquisition, dominating the IR field for years. With the advancement of pre-trained language models, generative information retrieval (GenIR) has emerged as a novel paradigm, gaining increasing attention in recent years. Currently, research in GenIR can be categorized into two aspects: generative document retrieval (GR) and reliable response generation. GR leverages the generative model's parameters for memorizing documents, enabling retrieval by directly generating relevant document identifiers without explicit indexing. Reliable response generation, on the other hand, employs language models to directly generate the information users seek, breaking the limitations of traditional IR in terms of document granularity and relevance matching, offering more flexibility, efficiency, and creativity, thus better meeting practical needs. This paper aims to systematically review the latest research progress in GenIR. We will summarize the advancements in GR regarding model training, document identifier, incremental learning, downstream tasks adaptation, multi-modal GR and generative recommendation, as well as progress in reliable response generation in aspects of internal knowledge memorization, external knowledge augmentation, generating response with citations and personal information assistant. We also review the evaluation, challenges and future prospects in GenIR systems. This review aims to offer a comprehensive reference for researchers in the GenIR field, encouraging further development in this area.

翻译：信息检索（IR）系统是用户获取信息的关键工具，广泛应用于搜索引擎、问答和推荐系统等场景。传统的信息检索方法基于相似度匹配返回排序后的文档列表，长期作为信息获取的可靠手段主导着信息检索领域。随着预训练语言模型的发展，生成式信息检索（GenIR）作为一种新兴范式近年来受到日益关注。当前，GenIR研究可分为两个方向：生成式文档检索（GR）与可信响应生成。GR利用生成模型参数记忆文档，通过直接生成相关文档标识符实现检索，无需显式索引；而可信响应生成则借助语言模型直接生成用户所需信息，突破了传统信息检索在文档粒度与相关性匹配上的局限，提供更灵活、高效且富有创造性的解决方案，从而更好地满足实际需求。本文旨在系统回顾GenIR领域的最新研究进展。我们将总结GR在模型训练、文档标识符设计、增量学习、下游任务适配、多模态GR及生成式推荐方面的进展，以及可信响应生成在内部知识记忆、外部知识增强、带引用的响应生成和个人信息助理方面的突破。同时，我们还将综述GenIR系统的评估方法、面临的挑战及未来前景。本综述旨在为GenIR领域的研究人员提供全面参考，推动该方向的进一步发展。