As a primary means of information acquisition, information retrieval (IR) systems, such as search engines, have integrated themselves into our daily lives. These systems also serve as components of dialogue, question-answering, and recommender systems. The trajectory of IR has evolved dynamically from its origins in term-based methods to its integration with advanced neural models. While the neural models excel at capturing complex contextual signals and semantic nuances, thereby reshaping the IR landscape, they still face challenges such as data scarcity, interpretability, and the generation of contextually plausible yet potentially inaccurate responses. This evolution requires a combination of both traditional methods (such as term-based sparse retrieval methods with rapid response) and modern neural architectures (such as language models with powerful language understanding capacity). Meanwhile, the emergence of large language models (LLMs), typified by ChatGPT and GPT-4, has revolutionized natural language processing due to their remarkable language understanding, generation, generalization, and reasoning abilities. Consequently, recent research has sought to leverage LLMs to improve IR systems. Given the rapid evolution of this research trajectory, it is necessary to consolidate existing methodologies and provide nuanced insights through a comprehensive overview. In this survey, we delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers. Additionally, we explore promising directions, such as search agents, within this expanding field.
翻译:作为信息获取的主要手段,诸如搜索引擎之类的信息检索系统已融入我们的日常生活。这些系统还充当着对话系统、问答系统和推荐系统的组成部分。信息检索的演进轨迹从基于词项方法的起源到与先进神经模型的整合,呈现出动态发展的态势。尽管神经模型在捕捉复杂上下文信号和语义细微差别方面表现出色,重塑了信息检索领域的格局,但它们仍面临数据稀缺、可解释性差以及生成上下文合理但可能不准确响应等问题。这一演变要求结合传统方法(如响应快速的基于词项的稀疏检索方法)与现代神经架构(如具有强大语言理解能力的语言模型)。与此同时,以ChatGPT和GPT-4为代表的大语言模型的出现,因其卓越的语言理解、生成、泛化和推理能力,彻底改变了自然语言处理领域。因此,近期研究致力于利用大语言模型改进信息检索系统。鉴于这一研究轨迹的快速发展,有必要通过全面综述来整合现有方法并提供细致洞察。在本综述中,我们深入探讨大语言模型与信息检索系统的融合,包括查询重写器、检索器、重排序器和阅读器等关键方面。此外,我们还探索了该扩展领域中的有前景方向,例如搜索智能体。